From marcello at perathoner.de Tue Sep 1 04:44:33 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 01 Sep 2009 13:44:33 +0200 Subject: [gutvol-d] Re: Grosly Broken browser or wiki In-Reply-To: References: Message-ID: <4A9D0921.7040700@perathoner.de> Greg Weeks wrote: > > Something is messed up. Can someone undo the last edit I just did for > the Science fiction bookshelf wiki page. It wiped it clean. > ibiblio is blocking posts longer than 64K. Even the older edits are longer than that so I can't restore them from here. I have contacted ibiblio to remove this limitation for pg. Meanwhile nothing is lost. We have all edits in the history. From marcello at perathoner.de Tue Sep 1 09:31:58 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 01 Sep 2009 18:31:58 +0200 Subject: [gutvol-d] Re: Grosly Broken browser or wiki In-Reply-To: <4A9D0921.7040700@perathoner.de> References: <4A9D0921.7040700@perathoner.de> Message-ID: <4A9D4C7E.3030102@perathoner.de> Marcello Perathoner wrote: > Greg Weeks wrote: >> >> Something is messed up. Can someone undo the last edit I just did for >> the Science fiction bookshelf wiki page. It wiped it clean. >> > > ibiblio is blocking posts longer than 64K. > > Even the older edits are longer than that so I can't restore them from > here. > > I have contacted ibiblio to remove this limitation for pg. > > Meanwhile nothing is lost. We have all edits in the history. Works now. Maybe we should split that page into smaller chunks: http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)/A-L http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)/M-Z or even more chunks. From Bowerbird at aol.com Thu Sep 3 11:36:30 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 3 Sep 2009 14:36:30 EDT Subject: [gutvol-d] Re: =?utf-8?q?Everyone_Wants_a_Kindle=E2=80=93For_=2450?= Message-ID: it ends up that people would like for the kindle to cost $50. somebody should tell david rothman about this. > http://mediamemo.allthingsd.com/20090903/study-everyone-wants-a-kindle-for-50/ kindle-shmindle. i want the apple itablet to cost $50... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 4 13:55:05 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Sep 2009 16:55:05 EDT Subject: [gutvol-d] keeping up with cory Message-ID: if you haven't kept up with cory lately, he does a nice little history review here: > http://www.locusmag.com/Perspectives/2009/09/cory-doctorow-special-pleading.html money quote: > I don't give away downloads because I'm just a swell guy -- > I do it because I'm a self-employed entrepreneur who > needs to make as much as he can to support his family. in other words, a free online copy doesn't _cost_ him money, it _makes_ him money. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pobox.com Sun Sep 6 07:31:49 2009 From: hart at pobox.com (Michael S. Hart) Date: Sun, 6 Sep 2009 07:31:49 -0700 (PDT) Subject: [gutvol-d] !@! An urgent appeal to all Canadian supporters of Project Gutenberg Message-ID: Please forward as you feel appropriate. Michael S. Hart Founder Project Gutenberg ---------- Forwarded message ---------- Date: Sun, 6 Sep 2009 01:03:59 -0700 (PDT) From: Mark Akrigg To: Michael Hart Cc: hart at pglaf.org Subject: An urgent appeal to all Canadian supporters of Project Gutenberg Dear friends: I am the founder of Project Gutenberg Canada, and would like to make a special appeal to Canadian supporters of Project Gutenberg. Our government is sponsoring a Copyright Consultation on future changes to the Copyright Act. This is an unprecedented request from the government for the people of Canada to express their views on copyright law. Please consider making your personal submission. In the submission which I made on behalf of Project Gutenberg Canada, I made the following five recommendations: 1. A "Safe Harbour" provision for works more than 75 years old where the life dates of the authors are not known 2. No extensions of copyright durations 3. Explicit assignment to the Public Domain of those photographs that were in the Public Domain in 1997 4. 75 year copyright for works with more than 15 authors 5. Enhanced protection of the Public Domain You can read the full PG Canada submission at http://www.ic.gc.ca/eic/site/008.nsf/eng/01390.html Your own submission should be in your own words, and can be quite short. We don't want to bury the government in spam, and truly individual submissions will have the greatest effect. There is no need to precisely mirror the recommendations I made. You will find the main Copyright Consultations page here: http://copyright.econsultation.ca/ You will find information on how to email your submission here: http://copyright.econsultation.ca/topics-sujets/show-montrer/18 You might also wish to send a copy of your submission to your Member of Parliament: http://www2.parl.gc.ca/Parlinfo/Compilations/HouseOfCommons/MemberByPostalCode.aspx?Menu=HOC The main Copyright Consultation page has information on how you can participate in the forums being conducted by the government on copyright issues, which naturally cover many issues which do not affect PG Canada, but do affect your life in other respects. The main thing is to make your submission sooner rather than later: the Copyright Consultation ends on September 13th. It appears possible that there will be a federal election in Canada this fall. Don't forget to tell your candidates that they are answerable to you when it comes to copyright law, and that you expect any future government to protect and promote the Public Domain. Thank you in advance for your help. And don't forget to make your submission! Dr. Mark Akrigg Founder, Project Gutenberg Canada From richfield at telkomsa.net Mon Sep 7 01:00:42 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Mon, 07 Sep 2009 10:00:42 +0200 Subject: [gutvol-d] Re: !@! An urgent appeal to all Canadian supporters of Project Gutenberg In-Reply-To: References: Message-ID: <4AA4BDAA.4080306@telkomsa.net> Michael S. Hart wrote: > Please forward as you feel appropriate. > > OK, so I am un-Canadian. So they don't have to read it. I sent them this. They won't do it of course, but I think some countries should consider the principle (among others of course). ============================ To: Copyright Consultations Your initiative in consulting Canadians on copyright matters does Canada credit, especially during a period of world-wide confusion and bad-faith violation and manipulation of copyrights. I am an author of largely semitechnical material and a heavy user of published material in general and I hope that you will consider some of the following points during the consultations. There is no question of any one correspondent covering the entire field of course. At the end of this document I address your questions as they were presented on your web page. Please note that I have nothing to say that specifically addresses anything but reading matter and illustrations, whether in electronic, printed or written form. Music, films and the like are outside my line of intimate involvement. To begin with we should understand that the entire matter is one of resolution of conflicts of interest. Realising this does not make the question simple, but trying to resolve it without clearly understanding that point would be simply futile, and only the lawyers would profit. At any point the question should be: "Whose interests would be furthered by such a measure?" If the answer is; "None in particular," then the measure should be considered no further. With due admiration for Mencken's "...there is always an easy solution to every human problem - neat, plausible, and wrong," I insist that the fewer and simpler the rules and regulations, the better. Let us consider some of the interests in possible conflict, in no definitive sequence. 1. The author or authors 2. The authors' estates, dependents, and heirs 3. Purchasers of copyrights 4. The publishers 5. The Canadian public who purchase the material 6. The Canadian public who use the material 7. The Canadian public image domestically and internationally 8. The International public who purchase the material 9. The International public who use the material 10. Posterity You will be well aware that there are emergent complications, both in good faith and very often in very bad faith, but as far as practical I am trying to stick to simple, commonsense lines of thought. 1. It is largely common cause that it is good that authors can publish and that they may exercise reasonable copyright. I do not consider complications such as authorship under contract or employment. 2. It similarly is good that an author that serves the public's desires suitably should be able to do so at sufficient profit to make it worth his own while and for adequate benefit to his dependents. 3. In spite of certain parties' idealistic objections, there is no practical basis apart from normal taxation, for limiting an author's legitimate profit from his work. If he or his publishers become billionaires from a book, then so be it. 4. It is in the public interest that an author's work be made available and that an author's productivity be nurtured for as long as public interests underwrite the published works through purchase or sponsorship or whatever arrangement suits the relevant parties. 5. There is no cogent basis for nominating any particular time limit to the copyright. 75 years after a work or 50 years after the author's death or the like are simply thumb-sucks at figures that suited particular parties, or were as long as they thought they could get away with. For most books they are too long by far, for a few they are probably too short. 6. It is a matter of the mildest concern one way or the other how long a book stays in copyright as long as it is sufficiently widely available in sufficient numbers and at reasonable cost if there is public demand. Publishing one copy a year in central Greenland at a price of a million dollars each as a bad-faith legalistic means of preventing public access would not meet the case. 7. Conversely, there are thousands of books that seem unlikely to get back into commercial print again, but are unsung classics. I could mention quite a few off my own shelves, such as "A Sailor's Life" by de Hartog, the autobiographical works of Alexander King, "Nature is your Guide" by Gatty, "Short History of the Art of Distillation" by Forbes, and a number of others that I do not wish to check for being in print at present. Some are textbooks of great value or primary documentation of events of great interest, but without commercial appeal. Such books often are doomed because no commercial publisher in his right mind would touch them, but by the time that they are out of copyright even the libraries and second-hand shops will have pulped their copies. They are of no benefit to any of the categories of interests that I listed above. Consider "Mr Belloc Objects" by Wells; it went out of print immediately after being published in 1926 and I am not even sure of its status today. However, it is one of the greatest gems of polemics in the history of science, and if certain specially interested parties had not scanned it in, it might have been lost by now. His "Science of Life" might well follow. At the same time, as long as works in that twilight zone are technically in copyright, projects such as Gutenberg will not touch them. 8. Any regulation that could be dispensed with without injustice, or could be substituted by a simpler or more self-regulatory convention is an imposition on both state and public and should be expunged or avoided. 9. The following scheme should accommodate or alleviate most of the foregoing considerations. 1. Copyright restrictions should apply according to some such scheme as those currently applicable. The exact terms and periods are not of major concern to this discussion. 2. As long as the product remains in print and reasonably available to the public through normal commercial channels etc and no other cogent objection can be raised, there need be little material change to the arrangements. 3. However, at any time after publication, any interested party could apply to some central national authority for non-exclusive copyright. He would have to give appropriate reasons why this should be granted. Such reasons would be of two basic types, firstly negative: lack of reasonable objections from interested parties. Examples of reasonable objections might include: the author might object to his publication being re-issued because of regret that he ever had published it. That would be valid. Conversely, the author might have no objection, but the publisher might wish to quash the book for competitive or personal reasons. That would not be valid. I cannot give a ranked list of negative considerations that the authority might consider, but it might be such things as that the author and family were deceased, that the book was out of print and that the former publishers had expressed lack of interest in re-commencing publication etc. Positive reasons might the public interest. A niche group might think the book of crucial value, but it might not at the present time be available. The appellant's own commercial interests would obviously not figure as strong arguments, and the author or his assignees would have claim for reasonable royalties. 4. If the arguments for allocating non-exclusive copyright were seen as adequate, then the original copyright holders would be notified if possible, and given a reasonable period to respond (perhaps half a year?) and if they did not respond, the copyright would not be ceded, but would be extended to the appellant, possibly with certain restrictions fitting the case. 5. The copyright, if granted to an appellant, would be non-exclusive; anyone else could concurrently ask for similar or different rights on the same or different grounds, and they might or might not succeed. 6. Any such copyright would remain contingent on no valid objection emerging subsequent to its being granted during the normal period of copyright. There would explicitly be no assurance that the appellant either could rely on no one else being granted a similar copyright. Also, the copyright might be withdrawn (without penalty, but also without compensation) if the original copyright holder subsequently produced adequate reasons for regaining exclusive copyright. The questions presented in the invitation to respond were as follows: 1. How do Canada?s copyright laws affect you? How should existing laws be modernized? As long as Canada is a signatory to international copyright conventions, including those that constrain the general use of material out of print, but still within copyright, everyone, including myself, suffers pointless loss of access to valuable material. (Of course, an even larger volume of total rubbish gets lost as well, but none of my suggestions aggravate its retention!) 2. Based on Canadian values and interests, how should copyright changes be made in order to withstand the test of time This is a little vague on two counts. Canadian values in context might at a guess include dignity, practicality, and fairness to all parties. The foregoing proposal seems to me to cover those. I should hope that the values would not assume slavishly unthinking adherence to traditional ways of doing things or to the NIH syndrome. As for Canadian interests, the scheme should entail no penalty whatever on any author or good-faith publisher, but should enable any party in Canada to avail themselves of resources that currently are being wasted pointlessly. Test of time? That is always hard to say antecedent to the test, but any scheme that puts the incentive to act constructively in the hands of the interested party, and permits correction in the event of error or justified objection, should not readily attract long-term resentment or annulment. 3. What sorts of copyright changes do you believe would best foster innovation and creativity in Canada? As detailed above. It would leave creative individuals (authors etc) with absolutely no reduction of their rights and incentives (they need never pay a lawyer to say "no" on their behalf, or (re)commence publication and circulation within a reasonable time, or whatever similar action might prove appropriate), but it also would enable users among the public to avail themselves of valuable works that otherwise would go to waste. If anything, they might profit from extra royalties. Possibly one also might wish to give attention to questions of unreasonable extensions of copyright in the hands of non-creators. Consider the case of the alleged behaviour of the copyright holders of "Gone With The Wind" in the US. 4. What sorts of copyright changes do you believe would best foster competition and investment in Canada? Something along the lines of the foregoing suggestion, calculated as it is, not only to increase access to desirable works by rescuing them from stagnation and unfair competition, but also increasing their returns for the author by increasing the scope for keeping them in print, might well attract foreign authors to print their works in Canada instead of in more hidebound countries that do less to promote publication. 5. What kinds of changes would best position Canada as a leader in the global, digital economy? I assume this refers to the current context only? After all, I am no economist! Canada is already a major leader in such fields. It is important to maintain flexibility, rather than to confuse rigidity with high standards. The most important thing is to ensure that laws accommodate the need to reward good faith and punish bad faith. A hypothetical illustration in the current international situation might be that whereas there need be no ceiling to the bonus that an executive of an enterprise could accept in the event of his delivering as contracted, it would have to be balanced by an ppropriately matching penalty in the event of non-delivery. A company, or even a third party should be able to invoke something of the type. Another principle should be very rapid turnover in court cases. There should be no limit to the value of damages that could be handled in what are currently the "small claims courts" or ombudsmen. Instead the assumption should be that their verdicts are correct, rapid, informal, and paid for by the state. Anyone who thinks that he has been hard done by in such a verdict should be informed of the basis of the decision before he appeals. If he then appeals, it gets passed on together with the details of the decision and objections to a supervisory and re-evaluatory committee that must report back within say 24 hours. If anyone still is dissatisfied he has recourse to the more ponderous mechanisms of the courts. The appellant then pays for everything, lawyers on both sides etc. Only if he then wins does he get his investment back. In general, keep things constructive, keep them brief, and concentrate on visible good faith and visible good sense. But that wasn't the sort of question I was expecting. Did I misunderstand the intent? Thank you for your attention. Feel welcome to contact me if in this hurried and incoherent note I left anything that seemed interesting but obscure. Jon Richfield ======================= Comments, up to and including horrified shrieks or bored yawns, as anyone prefers. Jon From Bowerbird at aol.com Tue Sep 8 02:30:36 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Sep 2009 05:30:36 EDT Subject: [gutvol-d] labor day -- working for peace Message-ID: the rev. carl kabat has spent over 10 years in jail as "punishment" for his _symbolic_ protests against nuclear weapons. > http://www.nytimes.com/2009/09/07/us/07activist.html what is wrong with our justice system? what is wrong with us? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From bzg at altern.org Mon Sep 7 20:36:00 2009 From: bzg at altern.org (Bastien) Date: Tue, 08 Sep 2009 11:36:00 +0800 Subject: [gutvol-d] Re: labor day -- working for peace In-Reply-To: (Bowerbird@aol.com's message of "Tue, 8 Sep 2009 05:30:36 EDT") References: Message-ID: <873a6yds67.fsf@bzg.ath.cx> Bowerbird at aol.com writes: > the rev. carl kabat has spent over 10 years in jail > as "punishment" for his _symbolic_ protests against > nuclear weapons. > >> http://www.nytimes.com/2009/09/07/us/07activist.html > > what is wrong with our justice system? > > what is wrong with us? Using uppercase letters in not completely useless... -- Bastien From pterandon at gmail.com Tue Sep 8 03:43:00 2009 From: pterandon at gmail.com (Greg M. Johnson) Date: Tue, 8 Sep 2009 06:43:00 -0400 Subject: [gutvol-d] What is the intended use of TXT format-- why line breaks? Message-ID: Hi. I have looked at PG's books in both HTML and TXT formats, on several different devices, from large-screen laptops to netbooks to an Ipod Touch. In just about every possible scenario, I had the line breaks creating an irregular right margin down the screen that made for unpleasant reading. I also tried taking one of the raw TXT files to make "my own" HTML file, and was tripped up by the line breaks. In order to prevent me from making the suggestion of changing the whole collection, can someone tell me why that number of characters on the screen was chosen? -- Greg M. Johnson http://pterandon.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ricardofdiogo at gmail.com Tue Sep 8 07:29:45 2009 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Tue, 8 Sep 2009 15:29:45 +0100 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: References: Message-ID: <9c6138c50909080729j1e3f90c7ha7f0cf0da9582ff8@mail.gmail.com> 2009/9/8 Greg M. Johnson > > Hi. Hi Greg M. > I have looked at PG's books in both HTML and TXT formats, on several different devices, from large-screen laptops to netbooks to an Ipod Touch. > > In just about every possible scenario, I had the line breaks creating an irregular right margin down the screen that made for unpleasant reading.? I also tried taking one of the raw TXT files to make "my own" HTML file, and was tripped up by the line breaks. > You can visit: http://www.gutenberg.org/wiki/Gutenberg:Readers%27_FAQ#R.30._When_I_print_out_the_text_file.2C_each_line_runs_over_the_edge_of_the_page_and_looks_bad. First, all paragraphs and separate lines should be separated by two HRs, so that you can see one blank line between them. Where they aren't, as in the case of a table of contents or lines of verse, add the extra HRs to make them so. Replace All occurrences of two HRs with some nonsense character or string that doesn't exist in the text, like ~$~. Replace All remaining HRs with a space. Replace your inserted string ~$~ with one HR. > In order to prevent me from making the suggestion of changing the whole collection, can someone tell me why that number of characters on the screen was chosen? > > You can visit: http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#About_the_formatting_of_a_text_file The idea is: using a pure text format, with a number of lines per page that can readable in most computers and preserved for the future to come. Ricardo F. Diogo From ricardofdiogo at gmail.com Tue Sep 8 07:31:14 2009 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Tue, 8 Sep 2009 15:31:14 +0100 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: <9c6138c50909080729j1e3f90c7ha7f0cf0da9582ff8@mail.gmail.com> References: <9c6138c50909080729j1e3f90c7ha7f0cf0da9582ff8@mail.gmail.com> Message-ID: <9c6138c50909080731l396d32b9u17a88c3e2dfed472@mail.gmail.com> (Of course, in my last message I meant "characters per line", not "lines per page". Ricardo F. Diogo) From Bowerbird at aol.com Tue Sep 8 09:22:23 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Sep 2009 12:22:23 EDT Subject: [gutvol-d] Re: labor day -- working for peace Message-ID: bastien said: > Using uppercase letters in not completely useless... what is wrong with us americans? and what is wrong with our justice system? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From lee at novomail.net Tue Sep 8 09:41:40 2009 From: lee at novomail.net (Lee Passey) Date: Tue, 08 Sep 2009 10:41:40 -0600 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: References: Message-ID: <4AA68944.2040108@novomail.net> Greg M. Johnson wrote: > In order to prevent me from making the suggestion of changing the whole > collection, can someone tell me why that number of characters on the > screen was chosen? In 1985 virtually all interactions with computers were performed via "smart terminals," predominately the DEC VT-52 and VT-100, which presented only text in an 80 x 25 array, that is, 25 lines each having at most 80 characters. The characters could be highlighted by reversing the electron output on the CRT (i.e. using 'on' instead of 'off', and vice-versa) but any other manipulation of the font, such a italic, bolding, or even a different font, was simply not possible. Even most personal computers of that day used VT-100 emulation. At that same time, we were being taught in typing class that left margins should be 66 characters; the bell would be set at 60, at which point the typist needed to decide whether the current word would fit in the 66 character limit, or whether it needed to be hyphenated. In 1985 the principals at Project Gutenberg did not want to deal with hyphenation, so no words were hyphenated. The current line length of Project Gutenberg files was designed so no word in unhyphenated form would ever cause a line to exceed 80 characters and wrap to a new line on a typical 1985-era smart terminal. In most ways, Project Gutenberg has not progressed beyond 1985. From hart at pobox.com Tue Sep 8 13:55:30 2009 From: hart at pobox.com (Michael S. Hart) Date: Tue, 8 Sep 2009 13:55:30 -0700 (PDT) Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: <4AA68944.2040108@novomail.net> References: <4AA68944.2040108@novomail.net> Message-ID: Too many search engines fail when words are hyphenated. There are all sorts of ways to remove hard returns in one second. It takes more time to complain than to actually find one of these and use it. . . . In many ways complainers have not evolved past Medieval Times. On Tue, 8 Sep 2009, Lee Passey wrote: > Greg M. Johnson wrote: > > > In order to prevent me from making the suggestion of changing the whole > > collection, can someone tell me why that number of characters on the screen > > was chosen? > > In 1985 virtually all interactions with computers were performed via "smart > terminals," predominately the DEC VT-52 and VT-100, which presented only text > in an 80 x 25 array, that is, 25 lines each having at most 80 characters. The > characters could be highlighted by reversing the electron output on the CRT > (i.e. using 'on' instead of 'off', and vice-versa) but any other manipulation > of the font, such a italic, bolding, or even a different font, was simply not > possible. Even most personal computers of that day used VT-100 emulation. > > At that same time, we were being taught in typing class that left margins > should be 66 characters; the bell would be set at 60, at which point the > typist needed to decide whether the current word would fit in the 66 character > limit, or whether it needed to be hyphenated. > > In 1985 the principals at Project Gutenberg did not want to deal with > hyphenation, so no words were hyphenated. The current line length of Project > Gutenberg files was designed so no word in unhyphenated form would ever cause > a line to exceed 80 characters and wrap to a new line on a typical 1985-era > smart terminal. > > In most ways, Project Gutenberg has not progressed beyond 1985. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From marcello at perathoner.de Tue Sep 8 15:07:06 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 09 Sep 2009 00:07:06 +0200 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: References: <4AA68944.2040108@novomail.net> Message-ID: <4AA6D58A.4070700@perathoner.de> Michael S. Hart wrote: > There are all sorts of ways to remove hard returns in one second. But no way to decide which ones to drop and which one to keep. > It takes more time to complain than to actually find one of these > and use it. . . . Actually nobody has yet come up with a satisfactory solution to this problem. > In many ways complainers have not evolved past Medieval Times. Here we go again. Blaming your customers is still cheaper than fixing your bugs. I guess that's your only way out if you can't find a single argument to uphold the boneheaded plain text format that PG is still producing. And you can't find one, because there ain't one. -- Marcello Perathoner webmaster at gutenberg.org From i30817 at gmail.com Tue Sep 8 15:27:49 2009 From: i30817 at gmail.com (Paulo Levi) Date: Tue, 8 Sep 2009 23:27:49 +0100 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: <4AA6D58A.4070700@perathoner.de> References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> Message-ID: <212322090909081527m1ea8c8ddpec17b8709f853c64@mail.gmail.com> I actually agree with the complaints if you don't mind my input (no flaming please). I try to do some sort of correction for my ebook reader, but its very primitive (and breakable) if the first alphabetic character in the new line is uppercase, keep the line, otherwise join them. First i tried if the last character of the previous line before a alphanumeric is a punctuation, keep the line, otherwise join it, but hey, more false positives. The one i uses at least corrects normal errors (Noun names non-withstanding) while keeping things like Chapter headings mostly intact (except lowercase off course). They can't be both applied i think. If some has a better algorithm, please share hey? This is one of the reasons i prefer html formats. A space is a space is not dozens of spaces and \n is nothing at all and

is king. From azkar0 at gmail.com Tue Sep 8 15:54:46 2009 From: azkar0 at gmail.com (Scott Olson) Date: Tue, 8 Sep 2009 16:54:46 -0600 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: <212322090909081527m1ea8c8ddpec17b8709f853c64@mail.gmail.com> References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> <212322090909081527m1ea8c8ddpec17b8709f853c64@mail.gmail.com> Message-ID: <2362473e0909081554i5d16a98drafdc9923d8d540b7@mail.gmail.com> On Tue, Sep 8, 2009 at 4:27 PM, Paulo Levi wrote: > If some has a better algorithm, please share hey? If the line begins with whitespace, don't re-wrap it unless you have to (usually poetry, or something else where the linebreaks matter). If there are blank lines (two or more \n's), don't wrap the stuff together (paragraphs, section/chapter breaks). Otherwise \n = space. -------------- next part -------------- An HTML attachment was scrubbed... URL: From desrod at gnu-designs.com Tue Sep 8 15:58:47 2009 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Tue, 8 Sep 2009 18:58:47 -0400 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: <4AA6D58A.4070700@perathoner.de> References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> Message-ID: On Tue, Sep 8, 2009 at 6:07 PM, Marcello Perathoner wrote: > But no way to decide which ones to drop and which one to keep. Given the previous example, negative lookahead assertions seem to fit well here: s/(? References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> Message-ID: <20090909003414.GA1609@pglaf.org> On Wed, Sep 09, 2009 at 12:07:06AM +0200, Marcello Perathoner wrote: > Michael S. Hart wrote: >> In many ways complainers have not evolved past Medieval Times. > > Here we go again. Blaming your customers is still cheaper than fixing > your bugs. Between David Widger & Al Haines, thousands of older text only titles were updated and HTML added. Nearly all new titles come with text and HTML. Many, many bugs were fixed. Some people might characterize the lack of HTML as a bug. The discussion thread had turned to different ways of identifying paragraph breaks in plain text. As Marcello rightly points out, it's tough to do automatically with complete accuracy (though many methods can produce less than complete accuracy). Nobody has argued that text is the master format, or should be. Or that it is somehow richer, or preserves more of the original formatting. There are some advantages...there are many limitations. PG's reasons for an emphasis on including plain text, if feasible, are well documented. One of the first responses in this thread pointed this out. I've adjusted the Subject line in my response, for people who want to talk about why text sux, among themselves. If there's enough interest, I can set up a separate mailing list for you. I won't be joining, however. -- Greg From Bowerbird at aol.com Wed Sep 9 02:46:42 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Sep 2009 05:46:42 EDT Subject: [gutvol-d] why the plain-text format is the most useful format for eliciting beauty (and more) Message-ID: read the reviews for the iphone e-book viewer-app called "eucalyptus". you will see that it gets credit for making p.g. e-texts look beautiful... eucalyptus uses the plain-text format; it elicits beauty from that format. greg notes that david widger and al haines are "updating" older e-texts with .html equivalents. do you know how they are accomplishing that? they run a program that converts the plain-text format to an .html file. to put it in other words, the .html file is elicited from the plain-text file. i guess david and al want to "get the credit" for creating the .html files, which is fine. but if they really wanted to increase overall productivity, they'd turn the conversion routine loose, so any end-user could run it, without having to wait for david or al to get around to the file they want. moreover, with more people using the routine, chances are that it would be improved via open-source coding contributions, which would be cool. but remember, it's the plain-text file that puts all of this action in play... *** greg said: > Nobody has argued that text is the master format, or should be. that's bull-shit, pure and simple. i have argued -- at length, with good arguments, ones that nobody has been able to counter -- that the plain-text format is the master. the same conversion processes that enable eucalyptus to elicit beauty from the plain-text file and which enable whitewashers to elicit .html can be used to create any type of file-format we might want to elicit, from the kindle to .epub to .pdf to .rtf to .lit to the-next-big-thing... now, that's not to say that the current form of the plain-text files is good enough to do the job, because it's not. but that's simply because the "powers that be" haven't accepted the modifications i've suggested. but that's their stupidity, it's not an inherent weakness in the format... in summary... if you're not smart enough to see that i have won this particular debate, step right up into the circle and i will be happy to knock you out, again. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Wed Sep 9 05:12:57 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 09 Sep 2009 14:12:57 +0200 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> Message-ID: <4AA79BC9.3040003@perathoner.de> David A. Desrosiers wrote: > On Tue, Sep 8, 2009 at 6:07 PM, Marcello > Perathoner wrote: >> But no way to decide which ones to drop and which one to keep. > > Given the previous example, negative lookahead assertions seem to fit well here: > > s/(? or /\n(?!\n)/ for zero width and /\n[^\n]/ for width=1 and so on. > > Plenty of ways to skin that cat in most regex-capable languages. ROTFL! Apply that algorithm to Hamlet and see. See if you can come up with an algorithm that doesn't make mincemeat of the following small excerpt. The algorithm should at least: 1. Recognizes that "HAMLET, PRINCE OF DENMARK by William Shakespeare" is the title statement of the work. This should be marked up like:

Hamlet, Prince of Denmark

by William Shakespeare

and NOT:

Hamlet, Prince of Denmark

by William Shakespeare

2. Not wrap the list of persons proper, BUT wrap

Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and other Attendants.

3. Recognize that

SCENE. Elsinore

is a stage direction, not the start of scene 1. 4. Recognize

ACT I.

5. Recognize

Scene I. Elsinore. A platform before the Castle.

(Even if it lacks spacing.) --- start excerpt from #1524 ---- HAMLET, PRINCE OF DENMARK by William Shakespeare PERSONS REPRESENTED. Claudius, King of Denmark. Hamlet, Son to the former, and Nephew to the present King. Polonius, Lord Chamberlain. Horatio, Friend to Hamlet. Laertes, Son to Polonius. Voltimand, Courtier. Cornelius, Courtier. Rosencrantz, Courtier. Guildenstern, Courtier. Osric, Courtier. A Gentleman, Courtier. A Priest. Marcellus, Officer. Bernardo, Officer. Francisco, a Soldier Reynaldo, Servant to Polonius. Players. Two Clowns, Grave-diggers. Fortinbras, Prince of Norway. A Captain. English Ambassadors. Ghost of Hamlet's Father. Gertrude, Queen of Denmark, and Mother of Hamlet. Ophelia, Daughter to Polonius. Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and other Attendants. SCENE. Elsinore. ACT I. Scene I. Elsinore. A platform before the Castle. [Francisco at his post. Enter to him Bernardo.] Ber. Who's there? Fran. Nay, answer me: stand, and unfold yourself. Ber. Long live the king! Fran. Bernardo? Ber. He. Fran. You come most carefully upon your hour. --- end excerpt #1524 ---- -- Marcello Perathoner webmaster at gutenberg.org From desrod at gnu-designs.com Wed Sep 9 06:03:05 2009 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Wed, 9 Sep 2009 09:03:05 -0400 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: <4AA79BC9.3040003@perathoner.de> References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> <4AA79BC9.3040003@perathoner.de> Message-ID: On Wed, Sep 9, 2009 at 8:12 AM, Marcello Perathoner wrote: > ROTFL! Apply that algorithm to Hamlet and see. > See if you can come up with an algorithm that doesn't make mincemeat of the > following small excerpt. The algorithm should at least: As you already know, parsing HTML is a much easier matter than parsing semi-freeflow text (which was the original poster's request). Also remember, I do this all the time for spiders we write for Plucker. I slice, I dice, and I make beautiful, automated works of art from the worst, most semantically-incorrect HTML out there. See some examples here: http://projects.plkr.org/ From hart at pobox.com Wed Sep 9 06:31:33 2009 From: hart at pobox.com (Michael S. Hart) Date: Wed, 9 Sep 2009 06:31:33 -0700 (PDT) Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> <4AA79BC9.3040003@perathoner.de> Message-ID: Want the simple way? Try unzipping to the Apple text format. . . . On Wed, 9 Sep 2009, David A. Desrosiers wrote: > On Wed, Sep 9, 2009 at 8:12 AM, Marcello > Perathoner wrote: > > ROTFL! Apply that algorithm to Hamlet and see. > > > See if you can come up with an algorithm that doesn't make mincemeat of the > > following small excerpt. The algorithm should at least: > > As you already know, parsing HTML is a much easier matter than parsing > semi-freeflow text (which was the original poster's request). > > Also remember, I do this all the time for spiders we write for > Plucker. I slice, I dice, and I make beautiful, automated works of art > from the worst, most semantically-incorrect HTML out there. See some > examples here: > > http://projects.plkr.org/ > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From marcello at perathoner.de Wed Sep 9 06:45:28 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 09 Sep 2009 15:45:28 +0200 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> <4AA79BC9.3040003@perathoner.de> Message-ID: <4AA7B178.9010009@perathoner.de> David A. Desrosiers wrote: > On Wed, Sep 9, 2009 at 8:12 AM, Marcello > Perathoner wrote: >> ROTFL! Apply that algorithm to Hamlet and see. > >> See if you can come up with an algorithm that doesn't make mincemeat of the >> following small excerpt. The algorithm should at least: > > As you already know, parsing HTML is a much easier matter than parsing > semi-freeflow text (which was the original poster's request). Do you read a post before replying? That's exactly what I requested you to do: To parse a plain text version of Hamlet into wrapped and non-wrapped paragraphs. -- Marcello Perathoner webmaster at gutenberg.org From desrod at gnu-designs.com Wed Sep 9 08:20:55 2009 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Wed, 9 Sep 2009 11:20:55 -0400 Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: <4AA7B178.9010009@perathoner.de> References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> <4AA79BC9.3040003@perathoner.de> <4AA7B178.9010009@perathoner.de> Message-ID: On Wed, Sep 9, 2009 at 9:45 AM, Marcello Perathoner wrote: > Do you read a post before replying? Of course... do you? > That's exactly what I requested you to do: To parse a plain text version of > Hamlet into wrapped and non-wrapped paragraphs. You did? The following looks pretty much like HTML to me, not plain ASCII text that wraps at 70 columns (like the original poster who started this thread requested). > See if you can come up with an algorithm that doesn't make mincemeat of the > following small excerpt. The algorithm should at least: > > 1. Recognizes that "HAMLET, PRINCE OF DENMARK by William Shakespeare" is the > title statement of the work. This should be marked up like: > >

Hamlet, Prince of Denmark

> by William Shakespeare

> > and NOT: > >

Hamlet, Prince of Denmark

>

by William Shakespeare

> > > 2. Not wrap the list of persons proper, BUT wrap

Lords, Ladies, Officers, > Soldiers, Sailors, Messengers, and other Attendants.

> > 3. Recognize that

SCENE. Elsinore

is a stage direction, not the start > of scene 1. > > 4. Recognize

ACT I.

> > 5. Recognize

Scene I. Elsinore. A platform before the Castle.

(Even > if it lacks spacing.) From desrod at gnu-designs.com Wed Sep 9 08:24:29 2009 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Wed, 9 Sep 2009 11:24:29 -0400 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: On Wed, Sep 9, 2009 at 5:46 AM, wrote: > read the reviews for the iphone e-book viewer-app called "eucalyptus". > you will see that it gets credit for making p.g. e-texts look beautiful... > eucalyptus uses the plain-text format; it elicits beauty from that format. And thanks to Apple, another compelling reason NOT to get an iPhone to read etexts: http://www.blog.montgomerie.net/whither-eucalyptus From hart at pobox.com Wed Sep 9 09:47:35 2009 From: hart at pobox.com (Michael S. Hart) Date: Wed, 9 Sep 2009 09:47:35 -0700 (PDT) Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line breaks? In-Reply-To: References: <4AA68944.2040108@novomail.net> <4AA6D58A.4070700@perathoner.de> <4AA79BC9.3040003@perathoner.de> <4AA7B178.9010009@perathoner.de> Message-ID: On Wed, 9 Sep 2009, David A. Desrosiers wrote: ... Now THAT is the plainest text message I've ever seen! From Bowerbird at aol.com Wed Sep 9 12:12:14 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Sep 2009 15:12:14 EDT Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) Message-ID: david said: > And thanks to Apple, another compelling reason don't care to get into the ring itself, do you david? so you point to a shiny distraction off to the side! we're talking about the plain-text format, and how it is the most useful format for eliciting beauty (and more)... we're talking about how so many people -- like you -- argued with me over a period of several years about this very point, and how i have emerged victorious... that's what we're talking about... -bowerbird p.s. and, since you mentioned it, the brouhaha with apple gave _tons_ of additional exposure to eucalyptus, so -- in the end -- it gave the program a tremendous lift. i'm gonna do everything i can to make sure that my apps are held up by apple the same way, to get public sympathy. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Sep 9 12:50:16 2009 From: jimad at msn.com (James Adcock) Date: Wed, 9 Sep 2009 12:50:16 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: I don't see how one "elicits beauty" from something that isn't there. Plain text doesn't have enough power to encode even simple mainstream texts, which frequently include the use of italic, for example. Yes, one can fake it, but then its not plain text anymore. I'd like to see a format that at least allows unambiguous encoding of mainstream texts, capturing the author's intent. Yes, once again one can fake it using HTML, but HTML contains SO MANY other weaknesses in the other direction! If we had an unambiguous encoding which captures authors intent, then it would be easy to go the other direction and "throw away" author's intent when it doesn't fit into plain jane text mode. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 9 17:15:11 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Sep 2009 20:15:11 EDT Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) Message-ID: jim said: > I don?t see how one ?elicits beauty? from something that isn?t there.? > Plain text doesn?t have enough power to encode even > simple mainstream texts, which frequently include > the use of italic, for example.? italics are indicated by surrounding _underscores_... > Yes, one can fake it, but then its not plain text anymore. you have an archaic and incorrect notion of "plain text"... ? > If we had an unambiguous encoding which captures authors intent, > then it would be easy to go the other direction and ?throw away? > author?s intent when it doesn?t fit into plain jane text mode. why would you want to "throw away" the author's intent? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pobox.com Wed Sep 9 22:15:59 2009 From: hart at pobox.com (Michael S. Hart) Date: Wed, 9 Sep 2009 22:15:59 -0700 (PDT) Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: On Wed, 9 Sep 2009, Bowerbird at aol.com wrote: > jim said: > >?? I don?t see how one ?elicits beauty? from something that isn?t there.? > >?? Plain text doesn?t have enough power to encode even > >?? simple mainstream texts, which frequently include > >?? the use of italic, for example.? > > italics are indicated by surrounding _underscores_... > > > >?? Yes, one can fake it, but then its not plain text anymore. > > you have an archaic and incorrect notion of "plain text"... ? > > > >?? If we had an unambiguous encoding which captures authors intent, > >?? then it would be easy to go the other direction and ?throw away? > >?? author?s intent when it doesn?t fit into plain jane text mode. > > why would you want to "throw away" the author's intent? > > -bowerbird > Most of the authors I have interviewed on this subject, perhaps all, told me they never wrote in italics, bold or underscore, that this is only a publisher artifact, nothing to do with "author's intent." Thanks!!! Michael S. Hart Founder Project Gutenberg Inventor of ebooks From user5013 at aol.com Thu Sep 10 02:45:09 2009 From: user5013 at aol.com (Christa & Jay Toser) Date: Thu, 10 Sep 2009 04:45:09 -0500 Subject: [gutvol-d] TXT format and hard line breaks Message-ID: I do not often post to this list. But the question is relevant. Unfortunately, I must state my preference for raw text files. And, to a great extent, I agree with Bowerbird. At least, those parts I can read. You need to know that I live in America. Born here. But I currently "surf" the "internet" with a PowerPC 6100/66. I use Mozilla ver. 3.0. Macintosh operating system 7.5.1. Dial-up is mostly 56K. Try downloading video with that set-up. In fact, try reading this user list with that set-up. I get SOME messages when I read directly. I get OTHER messages when I choose to read "raw source." And, when I go to my workplace, and read on their PC computers, I read more DIFFERENT messages. [Their internet connection is lots better. They download about 1/2 terabyte a day.] And yet, of the three different places that I can read this list, I NEVER get ALL the messages. Each and every one is different. Oh yes, there is overlap, but I'm not really sure if I have really gotten all the messages. Honestly! So, I think the original question was a two parter: 1) Why text only? 2) Why the hard line breaks? I must first apologize if I offend anyone by answering question 1). Text was considered universal to the English speaking world -- way back when Project Gutenberg started. This was at a time when Unicode would not exist for about two decades. I LOVE TEXT. As I just said above, I will not/can not/am not allowed to/ read all of your messages. Even if I go through three different setups, and two different servers, I am still not certain that I have read everything you have sent. I feel that I am being censored by the internet. It is truly my opinion that, if e-mail were just sent in TEXT, then I would know more of this world. Yes, a picture is worth a thousand words. No, I would rather read a thousand words than see a picture. Especially in this modern day, when everyone and their mother have a better way of showing data. Every country on the entire planet (that's what? 300+ countries?) they all have a new and better way to format text. Every different language must somehow show their data somehow JUST ABSOLUTELY CORRECT. Their standard is right. This standard is right. That standard is right. No, everything is wrong. Let's re-invent the wheel from scratch. No, it doesn't "look right." It has to be "correct." It is wrong if the text lines "break" at the "wrong" place. Errrm, got carried away there. In my opinion, I think the raw text of each book in Project Gutenberg, is the ultimate in how a book should be delivered. Again, I apologize if I have offended anyone on this list for writing my obvious opinion. 2) Why the hard line breaks? Partially, this was covered by (I think) Bowerbird. There was a time when there were no fonts. Specifically, there were no "variable-width" fonts. Way before the Macintosh existed, there was only one way to read text. And it was only one width per each character, and there were only 80 characters per line. Max. Period. And when Project Gutenberg was started, he set the standard at whatever existed at the time. Break it at never more than 80 characters -- and break it between words. No hyphenation. Now, this problem of hard line breaks is a legitimate problem. Now, several decades after Macintosh (and later, PC's); it is my considered opinion that there is no need for a hard line break. Even way-back-when, in the early days -- there was question of whether a hard line break was just a (line feed) or (carriage return, followed by line feed). Yep, there were format problems back before 1980. Now-a-days, with all the wonderful formatting which is available; in so many different fonts; in so many different platforms; with so many different programs; that can read so many different styles; well then -- what do we choose is right? **sigh** Above, I have described my computer system. I will tell you, that my computer system is more advanced than perhaps 2/3rds of the world. Most do not have the bandwidth for a .pdf. Or actually any kind of formatted book. Maybe, they have an hour per week at an internet cafe. At 10-12K speed. They don't care if the line breaks are wrong. They care if they can read the books. And pretty much throughout that world, they can only read the books, ONLY if simple 8-bit ASCII text exists. No one on Project Gutenberg, NO ONE, can guarantee a more universal format, nor a faster format to download, than text only. (Except perhaps 7-bit ASCII, [capital letters only]; or OCTAL; but that diverges.] My only recommendation in this debate is this: There is no longer a need for a hard line break at every 80 character line. However, I believe there is still a need for a hard line break between paragraphs. I believe the text versions of the books can be scanned for single or groups, and be removed. Double or should be maintained. And yes, I feel strongly this can be done to the ORIGINAL .txt files. It is my opinion that the technology of the world has truly advanced beyond the need for a hard line break at the end of every line. Paragraph breaks, yes. Line breaks, no. As to how this would translate into .pdf or Kindle? Gagg me with a spoon. I'm not there. Hope this helps, Jay Toser From pterandon at gmail.com Thu Sep 10 16:28:57 2009 From: pterandon at gmail.com (Greg M. Johnson) Date: Thu, 10 Sep 2009 19:28:57 -0400 Subject: [gutvol-d] In search of a more-vanilla vanilla TXT Message-ID: Jay Toser wrote: > 1) ... I LOVE TEXT ... Me too. My problem is that the HTML n.n (n's being really small numbers) format I think is more universally viewable TODAY than the 80-character-line TXT on more devices (my ipod touch and 12" wide laptop screen each giving different readability problems: wacky wraparound and too fine a print in Notepad, respectively). > 2) My only recommendation in this debate is this: There is no longer a need for a > hard line break at every 80 character line. However, I believe there is still a need for a > hard line break between paragraphs. Completely agreed. That was the gist of my original proposal. Another question is whether today's most primitive TXT-reading softwares now come with wraparound-- and by this I mean "terminal editors" like emacs and vi. Or what is the most primitive device in use today-- is it a 1980 Win 3.1 'puter, perhaps the itouch (in some regards)? Another idea is whether we could tolerate another format. If we've already got half a dozen, why not have another that is "non-80 plain text" (defined above by Jay). -- Greg M. Johnson http://pterandon.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pobox.com Thu Sep 10 18:32:48 2009 From: hart at pobox.com (Michael S. Hart) Date: Thu, 10 Sep 2009 18:32:48 -0700 (PDT) Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: Trouble reading on a 12" screen? I read on my 9" screen just fine. Perhaps a font or resolution adjustment might help? I was on my 9" for hours today, never noticed a problem, surfing, reading, I use Notebook there all the time with no problem, but I probably adjusted the font/resolution, etc., the first day I had it and never worried again. I do use $1 reading glasses with all my computers, I must admit. . . . Michael On Thu, 10 Sep 2009, Greg M. Johnson wrote: > Jay Toser wrote: > > > 1) ... I LOVE TEXT ... > > Me too.? My problem is that the HTML n.n (n's being really small numbers)? format I think is more universally viewable TODAY than the 80-character-line > TXT on more devices (my ipod touch and 12" wide laptop screen each giving different readability problems: wacky wraparound and too fine a print in > Notepad, respectively). > > > > 2) My only recommendation in this debate is this: There is no longer a need for a > > hard line break at every 80 character line. ?However, I believe there is still a need for a > > hard line break between paragraphs. > > Completely agreed.? That was the gist of my original proposal. > > Another question is whether today's most primitive TXT-reading softwares now come with wraparound-- and by this I mean "terminal editors" like emacs and > vi.? Or what is the most primitive device in use today-- is it a 1980 Win 3.1 'puter, perhaps the itouch (in some regards)? > > Another idea is whether we could tolerate another format. If we've already got half a dozen, why not have another that is "non-80 plain text" (defined > above by Jay). > > > > > -- > Greg M. Johnson > http://pterandon.blogspot.com > > From tb at baechler.net Fri Sep 11 00:43:51 2009 From: tb at baechler.net (Tony Baechler) Date: Fri, 11 Sep 2009 00:43:51 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: <4AA9FFB7.2040304@baechler.net> Hi, I know of at least two people who still use DOS regularly and at least one who uses a 486. Unfortunately, DOS doesn't handle very long lines as I know from personal experience. I would ask PG to please continue using the same text format with line breaks. Conversion of line endings can be done easily when unzipping the file or with any of several utilities on any OS. Before people tell me how DOS is old and no one should use it in their right mind, I would like to say that the people I know of simply can't afford anything else and in most cases lack the computer skills. Yes, Linux runs on a 486 but they don't want to learn a new OS. Also, they are blind. That in itself isn't relevant but a screen reader by itself costs at least $795 in most cases. Most blind people have a very small income and can't afford a new Windows computer. There are some free screen readers but they still require XP or better. With that said, for most people, long lines aren't a problem and I realize that PG can't please all of the people all of the time. Those same DOS users are also on dial-up. For various reasons, html viewing in DOS isn't practical. On 9/10/2009 4:28 PM, Greg M. Johnson wrote: > > 2) My only recommendation in this debate is this: There is no longer > a need for a > > hard line break at every 80 character line. However, I believe > there is still a need for a > > hard line break between paragraphs. > > Completely agreed. That was the gist of my original proposal. > > Another question is whether today's most primitive TXT-reading > softwares now come with wraparound-- and by this I mean "terminal > editors" like emacs and vi. Or what is the most primitive device in > use today-- is it a 1980 Win 3.1 'puter, perhaps the itouch (in some > regards)? > > Another idea is whether we could tolerate another format. If we've > already got half a dozen, why not have another that is "non-80 plain > text" (defined above by Jay). From Bowerbird at aol.com Fri Sep 11 01:44:38 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Sep 2009 04:44:38 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: greg said: > Or what is the most primitive device in use today the web-browser. a web-browser won't wrap the lines on a .txt file, so if the hard-returns were removed from p.g. .txt files, the lines would run off the screen of a web-browser. try it if you don't believe me. *** it's absolutely true that project gutenberg should have given users a tool that would remove the hard returns, and it should've done that years ago, but it's also true that the .txt files _should_ have hard-returns in them. now, i'd suggest that those hard-returns should mimic the ones found in the print-books against which the text was proofed, but that won't help the books already done. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 11 02:01:25 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Sep 2009 05:01:25 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: i said: > a web-browser won't wrap the lines on a .txt file, so > if the hard-returns were removed from p.g. .txt files, > the lines would run off the screen of a web-browser. i'm sorry. i was wrong. safari (at least) does wrap the lines. i'm not sure where i got that idea... at any rate, my apologies for the misinformation. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From traverso at posso.dm.unipi.it Fri Sep 11 02:34:20 2009 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Fri, 11 Sep 2009 11:34:20 +0200 (CEST) Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: (Bowerbird@aol.com) References: Message-ID: <20090911093420.1072310141@cardano.dm.unipi.it> >>>>> "Bowerbird" == Bowerbird writes: Bowerbird> i said: >> a web-browser won't wrap the lines on a .txt file, so if the >> hard-returns were removed from p.g. .txt files, the lines would >> run off the screen of a web-browser. Bowerbird> i'm sorry. i was wrong. safari (at least) does wrap Bowerbird> the lines. Bowerbird> i'm not sure where i got that idea... Bowerbird> at any rate, my apologies for the misinformation. Most browsers (IE, firefox, opera, konqueror) don't wrap, at least in the default configuration. Which makes sense, since wrapping may destroy information. I agree that PG should provide several custom TXT file formats. One might convert on the fly from one format to the other. Who cares to tune manually lines that are shorter than 55 characters? Still, this is one of the requirements, and one that often requires some time to achieve. One txt file in a sufficently rich encoding to allow correct representation is sufficient, everything else might be generated on the fly. And the best would be the format that carries most information: unicode, with the original line breaks as much as possible. Consider also that many HTML files are now provided with the original line breaks, and having the storage TXT file with the same lines would greatly simplify maintenance. Especially if one derives the txt file from the HTML automatically (or even better both from a common master). Carlo Traverso From hart at pobox.com Fri Sep 11 05:03:01 2009 From: hart at pobox.com (Michael S. Hart) Date: Fri, 11 Sep 2009 05:03:01 -0700 (PDT) Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: On Fri, 11 Sep 2009, Bowerbird at aol.com wrote: > greg said: > >?? Or what is the most primitive device in use today > > the web-browser. > > a web-browser won't wrap the lines on a .txt file, so > if the hard-returns were removed from p.g. .txt files, > the lines would run off the screen of a web-browser. > > try it if you don't believe me. > > *** > > it's absolutely true that project gutenberg should have > given users a tool that would remove the hard returns, > and it should've done that years ago, but it's also true > that the .txt files _should_ have hard-returns in them. > > now, i'd suggest that those hard-returns should mimic > the ones found in the print-books against which the text > was proofed, but that won't help the books already done. > > -bowerbird > > We did do that years ago, and years before that. We also had very similar discussions years ago. I can't tell you how many times we posted info about different ways to remove hard returns, what they were, etc., etc., etc. As along as there are people who want it all done for them without any knowledge fo how a computer works, this will be an issue, along with background color, font, font size, long or short pages or margins, refresh rates.... mh From marcello at perathoner.de Fri Sep 11 05:53:33 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri, 11 Sep 2009 14:53:33 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: <4AAA484D.8020705@perathoner.de> Michael S. Hart wrote: > I can't tell you how many times we posted info > about different ways to remove hard returns, > what they were, etc., etc., etc. Strawman. The problem is to decide which LF to retain. In Hamlet there are speeches that are verse and speeches that are prose. The LFs in verse need to be retained! There has never been posted any info on how to achieve this. OTOH there is much empirical evidence that the problem of restoring PG plain texts is intractable: Very many people have tried to write tools that convert the plain text mess into something usable. GutenMark, Munseys, Manybooks etc. come to mind. But when you download some of their machine-made repackages of PG you see that they didn't get very far. PG has a very high standard of accuracy for the words, thus an automatic conversion has to achieve the same high standard for the formatting. Unless somebody can provide this tool, much information has been lost. -- Marcello Perathoner webmaster at gutenberg.org From jimad at msn.com Fri Sep 11 08:26:46 2009 From: jimad at msn.com (Jim Adcock) Date: Fri, 11 Sep 2009 08:26:46 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: >why would you want to "throw away" the author's intent? I don't want to throw away author's intent. But the reality is, in many cases DP and PG do so. Leading and following underscores are not plain text. It is an encoding to signal to the reader that something is missing -- namely italics. One could have just as well -- or as badly -- used and as the signals to indicate to the reader that italics is missing. I don't doubt that eventually the reader can get used to what they're missing -- but why should they have to? If it were really that hard to much more closely follow author's intent then I could understand the trade-offs. But with today's technology it really wouldn't be hard to do much better. And again, if you *want* plain text then it's easy enough to go backwards and throw away the italic information, etc. From Bowerbird at aol.com Fri Sep 11 16:00:06 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Sep 2009 19:00:06 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: i said: > it's absolutely true that project gutenberg should have > given users a tool that would remove the hard returns, > and it should've done that years ago michael said: > We did do that years ago, and years before that. oh really? and just exactly where is that tool? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 11 16:16:28 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Sep 2009 19:16:28 EDT Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) Message-ID: jim said: > Leading and following underscores are not plain text.? sure they are. indeed, the underscore even falls in the 7-bit range. so it's as plain-text as plain-text can be, and it has a long and glorious tradition of indicating emphasis. > It is an encoding to signal to the reader that > something is missing -- namely italics.? actually, i think of it as an indicator to the "rendering agent" -- a.k.a. the viewer-program -- that the surrounded text is to be displayed with emphasis. (which generally means italics.) > One could have just as well -- or as badly -- used > [i]and[/i] as the signals to indicate to the reader > that italics is missing.? i used square-brackets rather than angle-brackets in the quote, but i could have used angle-brackets just like you did, jim... and yes, sir, any of those will work. indeed, .html uses the angle-brackets, and many bulletin-board systems use the square-brackets. and this is fine, because they use those brackets as _markup_, with no intention that the brackets will actually be _seen_ by any human beings. and likewise, i don't intend my underscores to be seen by human beings. just like .html, or forum markup, i expect that a viewer-app will intercede and display the emphasis just as i had intended. however -- and this is a very big _however_ -- in the case that those underscores _are_ being seen by actual human beings, it's not really all that much of a problem, because underscores are relatively non-intrusive, and they seem to provide emphasis, which is why they developed -- spontaneously -- for that purpose. the brackets, on the other hand, are terribly intrusive, and only obliterate the text to be emphasized, rather than emphasize it. likewise, the other bracket commands all serve as _obstacles_ to a human being who happens to be reading the text, and even to those human beings who have to work with the text in other capacities, such as editing it. z.m.l., on the other hand, is zen. that's why light-markup systems are taking over the world now. > I don't doubt that eventually the reader can get used to > what they're missing -- but why should they have to? they shouldn't. that's why i have programmed the viewer-apps that ensure that people don't have to read z.m.l. in its raw form. > If it were really that hard to much more closely follow > author's intent then I could understand the trade-offs.? > But with today's technology it really wouldn't be hard > to do much better. i agree. we can do much better than what we have been handed. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Fri Sep 11 20:55:31 2009 From: jimad at msn.com (James Adcock) Date: Fri, 11 Sep 2009 20:55:31 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAA484D.8020705@perathoner.de> References: <4AAA484D.8020705@perathoner.de> Message-ID: >PG has a very high standard of accuracy for the words, thus an automatic conversion has to achieve the same high standard for the formatting. I would be happy to start with if the same standard for the accuracy of punctuation was held as for the high standards expected of the words. Of course for poetry puncs and LF are basically the same issue. From jimad at msn.com Fri Sep 11 21:26:23 2009 From: jimad at msn.com (James Adcock) Date: Fri, 11 Sep 2009 21:26:23 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: jim said: > Leading and following underscores are not plain text. sure they are. indeed, the underscore even falls in the 7-bit range. so it's as plain-text as plain-text can be, and it has a long and glorious tradition of indicating emphasis. There are many long and inglorious methods of indicating emphasis in plain-text including *asterix* and SHOUTING and _underscore_ and and [i][/i] and they all suffer from the same problem: They are all not what the author wrote, at least not as implemented by the typically concurrently existing publisher. Now say 100 years later PG says ignore those previous efforts we as the publisher of this day knows better than the original intent so we will substitute something else for what was actually printed. Now if someone really only has a 7-bit teletype to print their PG on, then I can understand this. I can also understand PG's desire to continue to support such teletypists [[I tried using one when I was in college which tells you how old I am but it kept overheating and burning out based on my demands]] What I don't understand is why PG continues to be wedded to plain-text as an *input* encoding format demanded of people submitting texts to PG. Plain-text is too constrained to do the job well. HTML is too ambiguous, and too ill-matched to books to do well. We need something else, something that CAN be correctly and automagically converted "correctly" to one or another formats including plain-text, and Unicode, and HTML, and mobi, etc. And something that allows the simple every day tasks of the encoder, including italics and m-dash and poetry, titles and chapters and subchapters, publisher info, dates, etc to be handled correctly and easily. PS: Bit curious which blind reader handles _the underscore "convention"_ correctly - I've not seen _that_ one! -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 11 23:10:27 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 12 Sep 2009 02:10:27 EDT Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) Message-ID: jim said: > What I don?t understand is why PG continues to be > wedded to plain-text as an *input* encoding format > demanded of people submitting texts to PG.? well, if you _honestly_ "don't understand" the reason, jim, then i must say that you certainly aren't trying very hard... the plain-text format is the most valuable to people because it is the most pliable when it comes to reworking the content. > Plain-text is too constrained to do the job well.? first you want to constrain the format to an archaic definition... then you want to complain about it because it's too constrained. that's disingenuous. > HTML is too ambiguous, and too ill-matched to books to do well. no, that's not the problem -- .html can do a fine job on books, for the most part, but the problem is that's it's a pain to create. > We need something else, something that CAN be correctly > and automagically converted ?correctly? to one or another formats > including plain-text, and Unicode, and HTML, and ?mobi, etc.? that "something else" is z.m.l. > And something that allows the simple every day tasks > of the encoder, including italics and m-dash and poetry, > titles and chapters and subchapters, publisher info, dates, etc > to be handled correctly and easily. again, you're talking about z.m.l. but, you know, you can invent your own equivalent, if you like... > PS: Bit curious which blind reader handles > _the underscore ?convention?_ correctly ? > I?ve not seen _that_ one! i'll let tony answer that question. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Sat Sep 12 02:47:55 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 12 Sep 2009 11:47:55 +0200 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: <4AAB6E4B.4000004@perathoner.de> James Adcock wrote: > There are many long and inglorious methods of indicating emphasis in plain-text > including **asterix** and SHOUTING and _/underscore/_ and and [i][/i] > and they all suffer from the same problem: They are all not what the author > wrote, at least not as implemented by the typically concurrently existing > publisher. No author wrote italics before word processors became available to the end user. They _underlined_ the passages that they wanted the publisher to highlight. The publisher then choose an appropriate way of highlighting: /italics/ or s p a c e o u t or SMALLCAPS. Mediaeval copysts usually rubricated passages they wished to highlight. > Now say 100 years later PG says ignore those previous efforts we as > the publisher of this day knows better than the original intent so we will > substitute something else for what was actually printed. So what? The brick-and-mortar publishers of yore ignored the previous efforts of the monastic scribes because it was too expensive to print twice with different inks. They also ignored the underlining of the author and substituted an artifact of their choosing. Also that artifact was largely a function of the cultural environment: italics or spaceout. > What I don?t understand is why PG continues to be wedded to plain-text as an > **input** encoding format demanded of people submitting texts to PG. Nobody understands that. It is a waste of resources pure and simple. Consider that: * The bottleneck at DP is the post-processing stage. * The post-processor is burdened with the creation of one surplus txt file. * The whitewasher is burdened with one or more surplus txt files. * Every error needs to be fixed in more than one place (in html and up to three txt files, plus as many zips) * We could easily produce a (good enough) txt version from html on the fly with lynx in any encoding the user may want. -- Marcello Perathoner webmaster at gutenberg.org From sankarrukku at gmail.com Sat Sep 12 03:42:29 2009 From: sankarrukku at gmail.com (Sankar Viswanathan) Date: Sat, 12 Sep 2009 16:12:29 +0530 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: <4AAB6E4B.4000004@perathoner.de> References: <4AAB6E4B.4000004@perathoner.de> Message-ID: The final output from DP is a text. This is processed through Guiguts. Most of the Post Processors in DP use Guiguts for post processing. The html is generated from this text file. So no additional work is involved in producing a text file. Again there is no additional work in White Washing because of the text file. On Sat, Sep 12, 2009 at 3:17 PM, Marcello Perathoner wrote: > James Adcock wrote: > > There are many long and inglorious methods of indicating emphasis in >> plain-text including **asterix** and SHOUTING and _/underscore/_ and >> and [i][/i] and they all suffer from the same problem: They are all not what >> the author wrote, at least not as implemented by the typically concurrently >> existing publisher. >> > > No author wrote italics before word processors became available to the end > user. They _underlined_ the passages that they wanted the publisher to > highlight. The publisher then choose an appropriate way of highlighting: > /italics/ or s p a c e o u t or SMALLCAPS. > > Mediaeval copysts usually rubricated passages they wished to highlight. > > > Now say 100 years later PG says ignore those previous efforts we as the >> publisher of this day knows better than the original intent so we will >> substitute something else for what was actually printed. >> > > So what? The brick-and-mortar publishers of yore ignored the previous > efforts of the monastic scribes because it was too expensive to print twice > with different inks. > > They also ignored the underlining of the author and substituted an artifact > of their choosing. Also that artifact was largely a function of the cultural > environment: italics or spaceout. > > > What I don?t understand is why PG continues to be wedded to plain-text as >> an **input** encoding format demanded of people submitting texts to PG. >> > > Nobody understands that. It is a waste of resources pure and simple. > > Consider that: > > > * The bottleneck at DP is the post-processing stage. > > * The post-processor is burdened with the creation of one surplus txt file. > > * The whitewasher is burdened with one or more surplus txt files. > > * Every error needs to be fixed in more than one place (in html and up to > three txt files, plus as many zips) > > * We could easily produce a (good enough) txt version from html on the fly > with lynx in any encoding the user may want. > > > > > > > > -- > Marcello Perathoner > webmaster at gutenberg.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -- Sankar Service to Humanity is Service to God -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Sat Sep 12 05:04:22 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 12 Sep 2009 14:04:22 +0200 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: <4AAB6E4B.4000004@perathoner.de> Message-ID: <4AAB8E46.1080305@perathoner.de> Sankar Viswanathan wrote: > The final output from DP is a text. This is processed through Guiguts. Most of > the Post Processors in DP use Guiguts for post processing. The html is > generated from this text file. If this is true its all the more waste. If you output a text file from the OCR and later use a human to re-create HTML this is more work than letting the OCR output the HTML directly. And all this crooked workflow is needed because PG requires a txt file for hysterical reasons. No wonder Google is eating our lunch ... they know how to put software to work instead of people. > So no additional work is involved in producing a text file. Nice sophism. Additional work is required to produce the HTML file. So what? > Again there is no additional work in White Washing because of the text file. I don't believe you. Working 2 files (3, maybe 4) IS more work than working one file. Even if you just open the file to see if it is the right one, its work. -- Marcello Perathoner webmaster at gutenberg.org From pterandon at gmail.com Sat Sep 12 05:13:21 2009 From: pterandon at gmail.com (Greg M. Johnson) Date: Sat, 12 Sep 2009 08:13:21 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: Marcello wrote: >The problem is to decide which LF to retain. So you have to go to a program like Word, turn all instances of "^p^p" to "QQQQ", then delete "^p", then turn "QQQQ" back to "^p^p". This happened to work for me fairly well just now with "Ten Nights in a Bar Room". The novel itself looked okay but the license at the end had some spacing irregularites to it. However: i) There will be cases where folks are software limited. I cannot see anyone being able to do this on the Ipod Touch. I've tried to look at 80-TXT files a couple times from an Apple store in cases where there was no HTML version. This may be a silly example, but I think it's about making an impression on such cursory visitors to PG. ii) There will be cases where folks are skills limited. Would the stereotypical impoverished child in Honduras be able to do that? iii) What about Shakespeare? Michael wrote: > Trouble reading on a 12" screen? Yes. Why anyone ever came up with a 7.5 x 12.5 inch screen is beyond me, but you sort of have to choose a small pixel size to get some things in your workflow vertically all on the same page. And there are font sizes that are fine for reading things you never really need to **read**, like "File Edit View," and then there are font sizes which you'd want if you're forcing your eyes to actually read a whole book. Notepad might be fine for a short shopping list or work to-do list but not for an entire novel in monospace. Hence also wanting to redirect the viewing experience into an HTML browser with Ctrl + font-size-changing capability. (Edit: I just learned just now that Notepad has an option for changing font face!) Okay, I'll stop stirring the pot (if yall'd prefer I do), but here are two last ideas on this topic: i) If you are producing a book, *please* consider making an HTML version to be as important as the 80-TXT one, certainly more important than PDF, PUB, and MOBI. In my mind, the ones without HTML (and put the entire legalese at the front of the doc) are in some sense "lost to history" because they aren't nearly as readable. ii) Rather than curse the darkness, someone should light a candle. My response to my allegation of 80-TXT readability was to compile a DVD of 3850 books-- hopefully more books than any reasonable person would ever want to read in a lifetime-- all in ****unzipped**** HTML format-- structured with HTML which operates as I imagine the ideal book reading hand-held device ought (if I were ever to see one in operation). I've sent a workable draft to Michael; I'm now looking at squeezing in a mite more books and maybe setting up editor's picks. -- Greg M. Johnson http://pterandon.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Sat Sep 12 06:17:37 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 12 Sep 2009 15:17:37 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: <4AAB9F71.2030701@perathoner.de> Greg M. Johnson wrote: > i) If you are producing a book, *please* consider making an HTML version to be > as important as the 80-TXT one, certainly more important than PDF, PUB, and > MOBI. In my mind, the ones without HTML (and put the entire legalese at the > front of the doc) are in some sense "lost to history" because they aren't nearly > as readable. That could easily be done. We have to make HTML on the way to producing EPUB. So technically we just could spew out the HTML before packaging the EPUB. But I don't know if it *should* be done ... The problem is: Nobody has ever been able to generate even barely palatable HTML from PG TXT. For EPUB we can justify the ugly conversion because on most ebook readers and small screens ill-formatted EPUB is still better than TXT. But HTML is supposed to be viewed on browsers and big screens, so ill-formatted HTML will be worse than TXT. -- Marcello Perathoner webmaster at gutenberg.org From sankarrukku at gmail.com Sat Sep 12 08:31:04 2009 From: sankarrukku at gmail.com (Sankar Viswanathan) Date: Sat, 12 Sep 2009 21:01:04 +0530 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: <4AAB8E46.1080305@perathoner.de> References: <4AAB6E4B.4000004@perathoner.de> <4AAB8E46.1080305@perathoner.de> Message-ID: Most of the post processors in D.P depend on Guiguts for post processing. More than 80% of the texts have been produced by using Guiguts. But for the availability of the Guiguts program many of the post processors would have never ventured to post process. The Guiguts program has been written for the specific purpose of post processing of DP books. It is well supported with additional programs like Gutcheck and Jeebies. Guiguts generates the html from the text automatically. Guiguts has been written taking into account the DP process. Most post processors in DP are not technical people. Again the question is what do the users want? I am talking about people who download books from PG and not producers of other formats. Most of the users download text files. Just to quote an example the text only format of Alice in Wonderland is downloaded more often than the illustrated html version. The text version is the LCM. Do we have statistics about downloading of html and text versions? I am sure most users download the text version. So even if we have put in additional effort to produce a text version it is justified. Do we have any feedback from the actual users? Letters from users who submit detailed Errata shows that the text files are being used for teaching school children in the remote areas of U.S. These are the people who make the effort worthwhile. May be it also benefits people who are still on Dial Up. Plain text can be read in any computer. HTML? With all the quirks of IE6 and other browsers it is not easy to produce html which will render perfectly in all the browsers. The earlier discussion was about whether a ASCII text is necessary? DP does produce TEI text. But there are very few post processors who can do TEI format. The main reason is the absence of a software like Guiguts. On Sat, Sep 12, 2009 at 5:34 PM, Marcello Perathoner wrote: > Sankar Viswanathan wrote: > > The final output from DP is a text. This is processed through Guiguts. >> Most of the Post Processors in DP use Guiguts for post processing. The >> html is generated from this text file. >> > > If this is true its all the more waste. > > If you output a text file from the OCR and later use a human to re-create > HTML this is more work than letting the OCR output the HTML directly. > > And all this crooked workflow is needed because PG requires a txt file for > hysterical reasons. > > No wonder Google is eating our lunch ... they know how to put software to > work instead of people. > > > So no additional work is involved in producing a text file. >> > > Nice sophism. Additional work is required to produce the HTML file. So > what? > > > Again there is no additional work in White Washing because of the text >> file. >> > > I don't believe you. > > Working 2 files (3, maybe 4) IS more work than working one file. Even if > you just open the file to see if it is the right one, its work. > > > > > -- > Marcello Perathoner > webmaster at gutenberg.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -- Sankar Service to Humanity is Service to God -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Sat Sep 12 10:30:47 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 12 Sep 2009 19:30:47 +0200 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: <4AAB6E4B.4000004@perathoner.de> <4AAB8E46.1080305@perathoner.de> Message-ID: <4AABDAC7.7080609@perathoner.de> Sankar Viswanathan wrote: > Most of the post processors in D.P depend on Guiguts for post processing. More > than 80% of the texts have been produced by using Guiguts. But for the > availability of the Guiguts program many of the post processors would have never > ventured to post process. That's the more water to my mill. You need a custom program to proof the txt file while any old editor can proof html. > The Guiguts program has been written for the specific purpose of post processing > of DP books. It is well supported with additional programs like Gutcheck and > Jeebies. Bad enough that a special program had to be written while many free editors excel at doing html. > Guiguts generates the html from the text automatically. Guiguts has been written > taking into account the DP process. Yeah, for suitably small values of `HTML?. I installed guiguts and downloaded Hamlet #1524. Then I pushed the 'Autogenerate HTML' button in guiguts. This is part of what I got:

Ham. To be, or not to be,—that is the question:— Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles, And by opposing end them?—To die,—to sleep,— No more; and by a sleep to say we end The heartache, and the thousand natural shocks That flesh is heir to,—'tis a consummation Devoutly to be wish'd. To die,—to sleep;— To sleep! perchance to dream:—ay, there's the rub; For in that sleep of death what dreams may come, When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despis'd love, the law's delay, The insolence of office, and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would these fardels bear, To grunt and sweat under a weary life, But that the dread of something after death,— The undiscover'd country, from whose bourn No traveller returns,—puzzles the will, And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought; And enterprises of great pith and moment, With this regard, their currents turn awry, And lose the name of action.—Soft you now! The fair Ophelia!—Nymph, in thy orisons Be all my sins remember'd.

guiguts takes its place in the long file of products and services who tried to make something of PG plain text and failed. Mind you, I'm not saying that guiguts is a bad program, I'm saying that it is impossible to recover the formatting once a text has been dumbed down to PG plain text. > Again the question is what do the users want? Users want as many formats as possible to choose from. > So even if we have put in additional effort to produce a text version it is > justified. Not so. We can do that automatically with lynx --dump. lynx is free, so anybody can do that. If you produce a `smart? version you can dumb it down with software. If you produce a `dumb? version, it is impossible to smart it up again with software. > Do we have any feedback from the actual users? Letters from users who submit > detailed Errata shows that the text files are being used for teaching school > children in the remote areas of U.S. These are the people who make the effort > worthwhile. May be it also benefits people who are still on Dial Up. Why do *those* people make the effort worthwile? Are you a bit prejudiced against better-off people? "War and Peace" is 1.18M in HTML and 1.16M in TXT. How can that benefit people on dial-up? > Plain text can be read in any computer. HTML? With all the quirks of IE6 and > other browsers it is not easy to produce html which will render perfectly in all > the browsers. It is very easy indeed. Stick to the basic tags and even plucker on a cell phone will render perfectly. -- Marcello Perathoner webmaster at gutenberg.org From azkar0 at gmail.com Sat Sep 12 10:56:28 2009 From: azkar0 at gmail.com (Scott Olson) Date: Sat, 12 Sep 2009 11:56:28 -0600 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: <4AABDAC7.7080609@perathoner.de> References: <4AAB6E4B.4000004@perathoner.de> <4AAB8E46.1080305@perathoner.de> <4AABDAC7.7080609@perathoner.de> Message-ID: <2362473e0909121056va506760t46236c031dea0a43@mail.gmail.com> On Sat, Sep 12, 2009 at 11:30 AM, Marcello Perathoner < marcello at perathoner.de> wrote: > Sankar Viswanathan wrote: > > Most of the post processors in D.P depend on Guiguts for post processing. >> More than 80% of the texts have been produced by using Guiguts. But for the >> availability of the Guiguts program many of the post processors would have >> never ventured to post process. >> > > That's the more water to my mill. You need a custom program to proof the > txt file while any old editor can proof html. > Guiguts processes the output of the DP proofing process. That output is neither raw text, nor raw HTML. It's a mix of different markups that struggles to find a balance between unambiguous output, and ease of the actual proofing process. The format is one that's relatively easy to pick-up, as unobtrusive as possible to the proofing process, and one that can be fairly automatically converted to both text and html by the tools that have been designed. > I installed guiguts and downloaded Hamlet #1524. Then I pushed the > 'Autogenerate HTML' button in guiguts. This is part of what I got: > >

Ham. > To be, or not to be,—that is the question:— > Whether 'tis nobler in the mind to suffer > The slings and arrows of outrageous fortune > Or to take arms against a sea of troubles, > And by opposing end them?—To die,—to sleep,— > No more; and by a sleep to say we end > The heartache, and the thousand natural shocks > That flesh is heir to,—'tis a consummation > Devoutly to be wish'd. To die,—to sleep;— > To sleep! perchance to dream:—ay, there's the rub; > For in that sleep of death what dreams may come, > When we have shuffled off this mortal coil, > Must give us pause: there's the respect > That makes calamity of so long life; > For who would bear the whips and scorns of time, > The oppressor's wrong, the proud man's contumely, > The pangs of despis'd love, the law's delay, > The insolence of office, and the spurns > That patient merit of the unworthy takes, > When he himself might his quietus make > With a bare bodkin? who would these fardels bear, > To grunt and sweat under a weary life, > But that the dread of something after death,— > The undiscover'd country, from whose bourn > No traveller returns,—puzzles the will, > And makes us rather bear those ills we have > Than fly to others that we know not of? > Thus conscience does make cowards of us all; > And thus the native hue of resolution > Is sicklied o'er with the pale cast of thought; > And enterprises of great pith and moment, > With this regard, their currents turn awry, > And lose the name of action.—Soft you now! > The fair Ophelia!—Nymph, in thy orisons > Be all my sins remember'd.

> Guiguts wasn't designed to convert existing texts. It's purpose is to help a DP PPer turn the output of the DP rounds into the final product seen on PG. In this case, the DP text for a piece of poetry would have had the poetry wrapped in poetry markers, signifying to Guiguts that it had to treat the block of text as non-wrappable poetry, and not just a straight paragraph of prose. > > -- > Marcello Perathoner > webmaster at gutenberg.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richfield at telkomsa.net Sat Sep 12 11:04:11 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Sat, 12 Sep 2009 20:04:11 +0200 Subject: [gutvol-d] TXT and all that. Message-ID: <4AABE29B.3030701@telkomsa.net> Folks, I actually am up to the back teeth with the about it and abouting. A lot of intelligent people (sincerely meant, no sarcasm) saying mainly intelligent things about material problems and objectives and yet... SO I was going to shut up. ME!!! I'll try to make up for it by being short and to beside the point. As a long-time hands-on, low-level, DP support and development man, I understand the importance of TXT files. Really I do. Anyway, as a reader with a PC, I can represent the TXT files fairly comfortably, even for reading big books, where Dick and Jane meet in Whore and Piece. As a user of PCs that are mere years old, I understand the importance of more convenient formats. Trust me, I do. As either I shall not argue the point. If you seek the reason why, circumspice! Now, I am no major contributor to PG, but I have contributed some quick-and-dirty digitisations, using modest facilities. I use M$oft under protest, because nowadays as a user (read: bottom-feeder!) I don't have the time to learn decent stuff and instead I put up with all the triviality and inelegance. Upshot: * Scan the book (Actually, nowadays my scanner lies idle. Mainly I use my digital camera: faster, often better, far more flexible, more portable; (I can use it in libraries etc) and less harmful to the books too. That is nice!) * After having for some years used the perfectly useful crippleware version that came in the cereal box with my scanner I bought a decent omniscan on a $100 on-line special. That too is nice. *Feed the output into Word. (I actually have certain reservations about this, but note that Word has certain useful aspects: It deals fairly nicely with TXT AND with HTM. And it is programmable. I don't have to work with Courier or FTM Arial all the time.Useful font, but neither restful, nor comfortable for speed reading.) In fact, though I have not yet done anything with it, I am fairly sure that I have enough facilities at hand to produce PDF as well if I choose. But here the same prejudice that makes me appreciate TXT kicks in: If I lose my software or get stuck with moderately damaged files, I can easily edit HTM to make it readable, but I'll be blowed if I bother to learn more opaque formats. To be sure, the PG TXT format is, whatever its merits, not nice, but if that is what they want... *In Word, format the whole caboodle fairly nicely, illustrations and all. Use all the nice features, including programmability, for editing etc. BB doesn't like the result of course, and i can see why, but I have only one life, and not much more of that, so...Nice chapter headings, page numbering etc. Also tables of contents, whatever is free or nearly. It helps with the editing anyway, so why not? *What? PG doesn't like DOC? Tsk Tsk! So I do a conversion to "filtered" HTML. Hard work that. Takes dozens of keystrokes. Well, more than a dozen anyway. *Ahaaah! Gotcha! PG doesn't like Word HTM either! How do you like that my buck??? Big deal! Someone steered me in the direction of HTMLKit! Now THERE is a useful product. May the commercial users make the company stinking rich! They deserve it big time. HTMLKit convert a Word file so easily to clean HTML that I don't bother to keep backups except of the DOC. There I have a working HTML with figures, tables, the full catastrophe. *But what about TXT? Here is where word comes in useful. again. I steadfastly resist any hyphenation except where words are too long for the line, or where a word is hyphenated. This, apart from other virtues, make it trivial to break lines with the help of a macro or two. (Actually, I have used other convenient TXT processors to break lines, but that is a matter of ad hoc convenience). *Then there are the GUTCHKs and so on, bless their writers' hearts. Actually I use them before the HTML production to trap errors that slip through other processors. By the time I have done with the HTML, I usually have finished with the TXT as well. Pictures? No problem. If anyone wants them they can get them from the HTM version. *Seems a long way round? Maybe. But I already had most of the tools for other purposes, and knew how to use them. Apart from the necessary basic effort all it takes is largely automated document production and conversion. Two steps for two files when you come down to it. *I realise that BB and a few others have better ways of doing it, but every time I get tempted to sniff down those alleys, I go and lie down for six seconds till the feeling passes. I'll re-evaluate new options as soon as everyone else agrees on all of them. By that time books and computers both will be passe. Cheers, Jon From marcello at perathoner.de Sat Sep 12 11:13:24 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 12 Sep 2009 20:13:24 +0200 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: <2362473e0909121056va506760t46236c031dea0a43@mail.gmail.com> References: <4AAB6E4B.4000004@perathoner.de> <4AAB8E46.1080305@perathoner.de> <4AABDAC7.7080609@perathoner.de> <2362473e0909121056va506760t46236c031dea0a43@mail.gmail.com> Message-ID: <4AABE4C4.4050704@perathoner.de> Scott Olson wrote: > Guiguts wasn't designed to convert existing texts. It's purpose is to help a DP > PPer turn the output of the DP rounds into the final product seen on PG. In this > case, the DP text for a piece of poetry would have had the poetry wrapped in > poetry markers, signifying to Guiguts that it had to treat the block of text as > non-wrappable poetry, and not just a straight paragraph of prose. I see. I was told the output of DP was text and the html generated from it. Now I gather DP uses some sort of proprietary internal markup and can produce HTML without having to produce TXT? Am I right? -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Sat Sep 12 12:28:04 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 12 Sep 2009 15:28:04 EDT Subject: [gutvol-d] Re: TXT and all that. Message-ID: oh geez, how could marcello make it any more clear that he doesn't know jack shit about distributed proofreaders? *** jon said: > In Word, format the whole caboodle fairly nicely, illustrations > and all. Use all the nice features, including programmability,?for > editing etc.? BB doesn't like the result of course, and i can see why hey, wait, don't put words into my mouth, jon. lots of times i would like a .doc file very much. it'd be much more useful than an .html file or a butchered .txt version, that's for sure... > Nice chapter headings, page numbering etc. Also tables of contents, > whatever is free or nearly. It helps with the editing anyway, so why not? indeed. but then, of course, if you used one of my editing tools instead, you'd find that you get all of those things "free" with it, as well... > So I do a conversion to "filtered" HTML. Hard work that.? > Takes dozens of keystrokes. Well, more than a dozen anyway. the conversion from .zml to .html takes just one button-click. > Ahaaah! Gotcha! PG doesn't like Word HTM either! the .html from the .zml conversion is just fine, according to p.g. > HTMLKit convert a Word file so easily to clean HTML that > I don't bother to keep backups except of the DOC. sounds good. except, of course, that your .html files differ in their makeup from other post-processor's .html files, so -- down the line -- it will be absolutely impossible for someone to understand the various ripples underlying all of your different .html variants, which means they won't be able to _maintain_ those .html files. so instead, the person(s) maintaining the library will turn to the .txt versions, and do the little bit of work necessary so that those .txt versions can serve as the master from which .html is created. which is what you should have done in the first place... this all means you've all basically created something temporary, instead of something that is able to be maintained a long time... > I realise that BB and a few others have better ways of doing it, > but every time I get tempted to sniff down those alleys, I go > and lie down for six seconds till the feeling passes. i do have a better way. but if you want to keep doing it in your clumsy way, it's ok. that's the way most people do most things, and the world didn't fall apart because of it. (the world fell apart because of greedy bankers.) just so long as you realize that your work on the .html files will be thrown out down the line, and you are ok with that, everything's fine. thanks for your generosity in volunteering your time and energy to the cause of digitizing books. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Sep 12 12:34:33 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 12 Sep 2009 15:34:33 EDT Subject: [gutvol-d] one more reminder about those .txt files Message-ID: one more reminder that the iphone app "eucalyptus" creates beautiful books by using the .txt files from p.g. all you yahoos who continue to say that it can't be done have been proven wrong by a person who went and did it. -bowerbird p.s. the programmer has even done a pretty good job of detecting the places where lines should not be rewrapped, like in tables, address blocks, signature blocks, and so on. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cloos at jhcloos.com Sun Sep 13 08:49:38 2009 From: cloos at jhcloos.com (James Cloos) Date: Sun, 13 Sep 2009 11:49:38 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAB9F71.2030701@perathoner.de> (Marcello Perathoner's message of "Sat, 12 Sep 2009 15:17:37 +0200") References: <4AAB9F71.2030701@perathoner.de> Message-ID: In case anyone really wants to do it right, what PG needs is to have each book (and other documents) marked up semanticly. Of all of the exsting SGML/XML applications, TEI seems best for what PG is doing. Combined with SVG and X3D for graphics, xcite for any citations, etc. The best way to mark up existing PG texts may be to put the docuemnts in a wiki alongside scans and encourage the public to add the markup. Wiki-style markup seems to be easier to comprehend for most of the public. (And with reason.) In this model, incidently, each work could be served as a single file, complete with images and the like included inline. And the plain text version can be readily extracted using a stylesheet. TEI is at: http://www.tei-c.org/ -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 From cloos at jhcloos.com Sun Sep 13 08:55:51 2009 From: cloos at jhcloos.com (James Cloos) Date: Sun, 13 Sep 2009 11:55:51 -0400 Subject: [gutvol-d] Line Art Message-ID: I see that the HTML versions of a number of PG works include images of art from the original books. Are higher-resolution scans of those images available anywhere? I'd like to experiment with automated vectorization of scans of line art, and providing SVG format vector files of the line art back to PG would make the effort doubly useful. (Encapsulated PS and PDF files also could be made available, if desired.) -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 From sly at victoria.tc.ca Sun Sep 13 11:22:25 2009 From: sly at victoria.tc.ca (Andrew Sly) Date: Sun, 13 Sep 2009 11:22:25 -0700 (PDT) Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAB9F71.2030701@perathoner.de> Message-ID: Yes, TEI has been discussed in this group a number of times before. And there are some contributors using it. When I go to gutenberg.org and do an advanced search, looking for TEI as filetype, I find 210 results. One volunteer's guideline for using TEI can be found at: http://pgtei.pglaf.org/marcello/0.4/doc/20000-h.html In short, it is there, and is being used, but not by many people. Would you like to help contribute more TEI texts to the project? Thanks, Andrew On Sun, 13 Sep 2009, James Cloos wrote: > In case anyone really wants to do it right, what PG needs is to have > each book (and other documents) marked up semanticly. > > Of all of the exsting SGML/XML applications, TEI seems best for what > PG is doing. Combined with SVG and X3D for graphics, xcite for any > citations, etc. > From traverso at posso.dm.unipi.it Sun Sep 13 11:49:18 2009 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Sun, 13 Sep 2009 20:49:18 +0200 (CEST) Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: (message from Andrew Sly on Sun, 13 Sep 2009 11:22:25 -0700 (PDT)) References: <4AAB9F71.2030701@perathoner.de> Message-ID: <20090913184918.1EC95100F8@cardano.dm.unipi.it> The problem with PGTEI (the PG dialect of TEI for which PG has automatic conversion to several end-user formats) is that the final output is considered ugly by many contributors. A second problem is that there is no automatic conversion tool to get (almost) working PGTEI from DP internal markup. I believe that both problems could be solved with little effort. Carlo Traverso From Bowerbird at aol.com Sun Sep 13 12:32:57 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 13 Sep 2009 15:32:57 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > In case anyone really wants to do it right, > what PG needs is to have each book > (and other documents) marked up semanticly. > > Of all of the exsting SGML/XML applications, > TEI seems best for what PG is doing.? jim, first of all, you're wrong. and second of all, you're about 5-8 years late for this conversation. i think the archives are still available, and will give you a good idea of this long-raging debate. but thanks for giving us all a blast from the past. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From no.la at web.de Sun Sep 13 13:36:19 2009 From: no.la at web.de (Norbert Langkau) Date: Sun, 13 Sep 2009 22:36:19 +0200 Subject: [gutvol-d] Re: Line Art Message-ID: <1258367676@web.de> Hi James, please have a look at this book: http://www.gutenberg.org/etext/23787 The "base drectory" directs you to http://www.gutenberg.org/files/23787/ where you find http://www.gutenberg.org/files/23787/23787-page-images/ These are 300 dpi, as far as I recall, but I could provide some 600 dpi's if need be. I'm very curious on how the outcome will look like. Best regards - Norbert > -----Urspr?ngliche Nachricht----- > Von: "James Cloos" > Gesendet: 13.09.09 17:56:42 > An: gutvol-d at lists.pglaf.org > Betreff: [gutvol-d] Line Art > I see that the HTML versions of a number of PG works include images > of art from the original books. > > Are higher-resolution scans of those images available anywhere? > > I'd like to experiment with automated vectorization of scans of line > art, and providing SVG format vector files of the line art back to PG > would make the effort doubly useful. (Encapsulated PS and PDF files > also could be made available, if desired.) > > -JimC > -- > James Cloos OpenPGP: 1024D/ED7DAEA6 > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > ______________________________________________________ GRATIS f?r alle WEB.DE-Nutzer: Die maxdome Movie-FLAT! Jetzt freischalten unter http://movieflat.web.de From jimad at msn.com Sun Sep 13 15:47:32 2009 From: jimad at msn.com (Jim Adcock) Date: Sun, 13 Sep 2009 15:47:32 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: Sigh. I don't know what the solution is, but for me as a content-provider it is heart-breaking to do my best to try to "do the job right" and then see the hard-won knowledge and effort I have put into "doing it right" thrown away BOTH by the txt and the html as implemented by PG. I'd love to see an input format that preserves the hard-won effort I put into content creation, AND which is NOT a "write once" format, such that future content producers can easily build on the efforts I have already put into creating a correct content creation, and NOT have to redo the work I have already done because BOTH txt and html as implemented by PG throw away work effort I have already done. Yes, it is possible for future content producers to go over the text front to back another three or four passes after I have done so already in order to try to "catch" again the errors that txt and html have re-introduced -- but why would anyone want that they should have to do so? What I would like to see as an input-submission format is something that: 1) Preserves the hard-won effort I have already put into content creation, such that a future volunteer can build on my work without having to "reverse engineer" those gratuitous errors currently being introduced by the current PG use of txt and html. 2) Works well-enough even with commonly available "bottom feeder" tools. [[Personally I get tired of claims of "magic bullet" tools and then I spend a day trying to get them to work on my computer and they don't even install and run correctly.]] 3) Does simple common tasks in a simple transparent way. 4) Isn't ugly or ungainly for simple common everyday tasks. 5) Can be -- and is in practice -- transformed from input format to a variety of end reader formats in an attractive manner which does not contain common uglinesses for common book situations. From marcello at perathoner.de Sun Sep 13 16:11:00 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 14 Sep 2009 01:11:00 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: <4AAD7C04.5030806@perathoner.de> Jim Adcock wrote: > 1) Preserves the hard-won effort I have already put into content creation, > such that a future volunteer can build on my work without having to "reverse > engineer" those gratuitous errors currently being introduced by the current > PG use of txt and html. Please give some real-world examples. -- Marcello Perathoner webmaster at gutenberg.org From tb at baechler.net Sun Sep 13 19:56:27 2009 From: tb at baechler.net (Tony Baechler) Date: Sun, 13 Sep 2009 19:56:27 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: <4AADB0DB.5000209@baechler.net> I'm not sure what you mean by this. Most screen readers will read underlines or periods as underlines or periods, so there is no emphasis on bold or italics. If you mean something else, please accept my apologies and elaborate further. I personally turn off all punctuation when reading because I don't want to hear the periods and such. In Windows and MS Word, it will tell you if something is formatted differently. In English Braille which is also 7-bit, there is usually an accent mark (the equivalent to "`" to the sighted) to show any accented letter and a similar underline (or underscore if you prefer) convention for other emphasis. On 9/11/2009 9:26 PM, James Adcock wrote: > > PS: Bit curious which blind reader handles _/the underscore > ?convention?/_ correctly ? I?ve not seen _/that/_ one! > > From richfield at telkomsa.net Mon Sep 14 02:17:17 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Mon, 14 Sep 2009 11:17:17 +0200 Subject: [gutvol-d] Re: Line Art In-Reply-To: References: Message-ID: <4AAE0A1D.9080306@telkomsa.net> Dunno really. When I scan a book I photograph the pages, usually at fairly low resolution for OCR. Any line art (or other) pictures I may photograph over again with more care and higher resolution. However, I don't know how relevant this is to your question. My books are generally either technical or of historical interest and I have not yet had occasion to prepare one on art as such. Accordingly I am more interested in producing a functional representation than an artistically adequate one. Cheers Jon > I see that the HTML versions of a number of PG works include images > of art from the original books. > > Are higher-resolution scans of those images available anywhere? > > I'd like to experiment with automated vectorization of scans of line > art, and providing SVG format vector files of the line art back to PG > would make the effort doubly useful. (Encapsulated PS and PDF files > also could be made available, if desired.) > > -JimC > From jimad at msn.com Mon Sep 14 08:59:33 2009 From: jimad at msn.com (Jim Adcock) Date: Mon, 14 Sep 2009 08:59:33 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAD7C04.5030806@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> Message-ID: >> 1) Preserves the hard-won effort I have already put into content creation, >> such that a future volunteer can build on my work without having to "reverse >> engineer" those gratuitous errors currently being introduced by the current >> PG use of txt and html. > >Please give some real-world examples. OK. My point being that IF PG were to accept a "proper" book INPUT encoding format that preserves the hard-won knowledge of the original encoding volunteer, then there would be no need for a future volunteer to have to completely scan that encoding against the original book scans in order to make another pass looking for errors, etc. So what all is "wrong" with TXT and HTML in this regards as stored in the PG databases? Both formats throw away the original volunteers' knowledge about the common parts of books: TOC, author info, pub info, copyright pages, index, chapters, etc. Yes one can code this information in HTML but there is no unambiguous way to do so which means that PG HTML encodings all take different paths, as one rapidly discovers if one tries to automagically convert PG HTML into other reflow file formats. You could follow common h1, h2, h3 settings by convention -- if PG were to establish and require such -- but then you end up with really ugly rendered HTML on common displays. You can overcome this with style sheets -- but then you are defeating many tools which automagically convert HTML into a variety of other reflow file formats for the various e-readers. Both formats as stored by PG gratuitous throw away hard-won line-by-line alignments between scan text and hand-scanno corrected text. These alignments are needed if a future volunteer wanted to make another pass at "fixing" errors in the text, for example by running through DP again, or running it against a future automagic tool comparing a new scan to the PG text. I submit my HTML to PG WITH the original line-by-line alignments -- because it doesn't in any way hurt the HTML and allows a future volunteer to make another pass on my work -- but then PG insists on throwing this information away anyway before posting their HTML files. Both formats throw away page numbers and page breaks, which again are necessary to make another volunteer pass against the original scans, and also to make future passes against broken link info, etc. Also would be useful for some college courses, where you need page number refs, even if reading on a reflow reader device. I'm NOT suggesting that page numbers should be typically be displayed in an OUTPUT reflow file format rendering, rather that this represents hard-won information that ought to be retained in a well-designed INPUT file format encoding. TXT files seem to me to almost always have some glyphs outside of the 8-bit char set. Unicode text files would at least overcome this limitation. HTML in theory doesn't have this limitation, but in practice I find in submitting "acceptable" HTML to PG running it through their battery of acceptance tools I find some glyphs I can't get through so I end up punting and throwing away "correct" glyph information dumbing down the representation of some glyphs. PG and DP *in practice* have a dumbed-down concept of punctuation, such that it's impossible to maintain and retain "original authors intent" as expressed in the printed work. For example, M-Dash is commonly found in three contexts: lead-in, lead-out, and connecting, similar to how ellipses are used at least in three different ways: ...lead in, lead out, and ... connecting. But in practice all one can get through PG and DP is connecting M-dash. Also consider all the [correct] variety of Unicode quotation marks which needlessly get reduced in PG and DP to only U+0022 OR U+0027. In general PG has a dumbed-down concept of punctuation, that near is near enough, and is actively hostile to accurately encoding the punctuation as rendered in the original print document. Again, it is EASY to dumb down an INPUT file format, for example if you need to output to a 7-bit or even a 5-bit teletypewriter, if that is what you want. So why insist that the input file encoder get it wrong in the first place? It is easy to throw away information when going from an INPUT file encoding to an OUTPUT file rendering. It is VERY DIFFICULT to correctly fix introduced errors when going back from a reduced OUTPUT file rendering to a correctly encoded input file encoding. What I am imagining is some simple-to-use file encoding format where a volunteer can correctly and unambiguously code the common things and conventions one commonly finds in every day books, such that another volunteer can pick up and make another pass on the book some years hence -- without having to reinvent nor rediscover work that the previous volunteer has already put into understanding and coding the book. Such an INPUT file encoding having little or nothing to do with how the output will be displayed in an eventual OUTPUT file rendering. DP already has much of this distinction in their work flow. Unfortunately, their page-by-page conventions and simplifications "dumbing down" for the sake of the multiple levels of volunteers guarantees loss of information. Not to mention that they also throw away the correctly encoded INPUT file hard-won knowledge for more ambiguous OUTPUT file renderings in HTML and TXT. The end result is that both PG and DP end up be "write once" efforts that are hostile to future improvements by future volunteers -- instead of encouraging on-going efforts to improve what we got. Which is also indicative of a general culture of quantity not quality. PG pretends that part of why we do what we do is to protect and preserve books in perpetuity. This implies in exchange that information that is gratuitously thrown away during input file encoding [directly in an output file rendering] is potentially lost for eternity. Why insist via policy that volunteer input file encoders must throw away this information? From jimad at msn.com Mon Sep 14 09:31:47 2009 From: jimad at msn.com (Jim Adcock) Date: Mon, 14 Sep 2009 09:31:47 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: <4AADB0DB.5000209@baechler.net> References: <4AADB0DB.5000209@baechler.net> Message-ID: Yes, you understand me the way I understand the blind readers I know of, namely that they will read _an ironic reference_ as "underscore an ironic reference underscore" not read with prosodic emphasis "an ironic reference". Thus when you turn off punctuation you also lose any representation of prosodic emphasis that the author originally encoding in the original printed text. This is not a small deal, IMHO. There are some books such as "The Dove" by Henry James which are virtually impossible to even scan without maintaining the author's original proper representation of prosodic emphasis. >I'm not sure what you mean by this. Most screen readers will read underlines or periods as underlines or periods, From marcello at perathoner.de Mon Sep 14 10:47:08 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 14 Sep 2009 19:47:08 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> Message-ID: <4AAE819C.1040108@perathoner.de> Jim Adcock wrote: > OK. My point being that IF PG were to accept a "proper" book INPUT encoding > format that preserves the hard-won knowledge of the original encoding > volunteer, then there would be no need for a future volunteer to have to > completely scan that encoding against the original book scans in order to > make another pass looking for errors, etc. There's a misconception here. PG *does* allow you to post additional file formats *along* with TXT and HTML. TEI comes to mind as format perfectly suitable to preserve a lot that HTML cannot. The reason that there isn't a TEI file posted along with *every* ebook is that most PPers at DP don't care to produce one. > Both formats throw away the original volunteers' knowledge about the common > parts of books: TOC, author info, pub info, copyright pages, index, > chapters, etc. TEI has elements for all these cases. > TXT files seem to me to almost always have some glyphs outside of the 8-bit > char set. Unicode text files would at least overcome this limitation. I don't see any problem here: Produce utf-8 files. The whitewashers will create some work for themselves by converting the utf-8 to all sorts of embarrassing encodings and then waste more time at the helpdesk to explain to incredulous users what `encodings? are, but that need not be your problem. -- Marcello Perathoner webmaster at gutenberg.org From prosfilaes at gmail.com Mon Sep 14 11:40:18 2009 From: prosfilaes at gmail.com (David Starner) Date: Mon, 14 Sep 2009 14:40:18 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAE819C.1040108@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> Message-ID: <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> On Mon, Sep 14, 2009 at 1:47 PM, Marcello Perathoner wrote: > PG *does* allow you to post additional file formats *along* with TXT and > HTML. TEI comes to mind as format perfectly suitable to preserve a lot that > HTML cannot. > > The reason that there isn't a TEI file posted along with *every* ebook is > that most PPers at DP don't care to produce one. And the reason for that is not only is it a lot more work than an HTML edition, unsupported by any sort of tools, it's worthless to the end user, as apparently no one at PG can get decent output from it. -- Kie ekzistas vivo, ekzistas espero. From hart at pobox.com Mon Sep 14 14:22:02 2009 From: hart at pobox.com (Michael S. Hart) Date: Mon, 14 Sep 2009 14:22:02 -0700 (PDT) Subject: [gutvol-d] PG French eBook #1500 Message-ID: Right now we are looking at Voltaire, de Toqueville's "Democracy," and a few others. 20 more and we are at 1500. Please take a look for various copies of "Democracy" and anything else you think we might be able to use, and let me know. Thanks!!! Michael S. Hart Founder Project Gutenberg From jimad at msn.com Mon Sep 14 14:34:00 2009 From: jimad at msn.com (Jim Adcock) Date: Mon, 14 Sep 2009 14:34:00 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAE819C.1040108@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> Message-ID: >TEI comes to mind as format perfectly suitable to preserve a lot that HTML cannot. Um, this standard is 1350 pages long. Tell me again why I should be reading it? I want to code books -- not the Sistine Chapel. >I don't see any problem here: Produce utf-8 files. But that would still leave all the other problems with txt files. And the reason we are required to produce txt is to support those with teletypewriters. Rhetorically speaking why not just produce as bad txt files as one can and still get away with it and hope that someday soon both Gut readers and Gut content produces will see the light and give txt up as long gone dead? From prosfilaes at gmail.com Mon Sep 14 15:01:05 2009 From: prosfilaes at gmail.com (David Starner) Date: Mon, 14 Sep 2009 18:01:05 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> Message-ID: <6d99d1fd0909141501p2eb26b30wc73591fcf5998f6b@mail.gmail.com> On Mon, Sep 14, 2009 at 5:34 PM, Jim Adcock wrote: >>I don't see any problem here: Produce utf-8 files. > > But that would still leave all the other problems with txt files. ?And the > reason we are required to produce txt is to support those with > teletypewriters. So? UTF-8 works just fine when viewed in a UTF-8 xterm, and can be translated on the fly by many programs. -- Kie ekzistas vivo, ekzistas espero. From ajhaines at shaw.ca Mon Sep 14 15:49:42 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Mon, 14 Sep 2009 15:49:42 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> Message-ID: <44414895897048CF95F99FE12CC881FD@alp2400> ----- Original Message ----- From: "Jim Adcock" To: "'Project Gutenberg Volunteer Discussion'" Sent: Monday, September 14, 2009 2:34 PM Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT > >TEI comes to mind as format perfectly suitable to preserve a lot > that HTML cannot. > > Um, this standard is 1350 pages long. Tell me again why I should be > reading > it? I want to code books -- not the Sistine Chapel. > Check out http://pgtei.pglaf.org/. Marcello's PG-TEI manual is <200 pages. There's also TEI-Lite at http://www.tei-c.org/Guidelines/Customization/Lite/. >>I don't see any problem here: Produce utf-8 files. > > But that would still leave all the other problems with txt files. And the > reason we are required to produce txt is to support those with > teletypewriters. Rhetorically speaking why not just produce as bad txt > files as one can and still get away with it and hope that someday soon > both > Gut readers and Gut content produces will see the light and give txt up as > long gone dead? > Text will never be dead. It's portable to all platforms, doesn't need a browser or a PDF-like reader, only the most basic editor. In modern terms, it's the stem cell of ebook files--all else can be generated from it. Maybe with greater or lesser prettiness, but as long as you get the words, who cares what the quote marks look like? > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From jimad at msn.com Mon Sep 14 20:31:08 2009 From: jimad at msn.com (Jim Adcock) Date: Mon, 14 Sep 2009 20:31:08 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <44414895897048CF95F99FE12CC881FD@alp2400> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> Message-ID: >....but as long as you get the words, who cares what the quote marks look like? There are a lot of texts where you cannot "get" the words from just the words. There are also texts with quotes within quotes, where if you don't care what the quote marks look like _you cannot read it!_ Certainly a text like Tristram Shandy demonstrates there are books which are NOT just about the words -- where rather, the artistry of representing word on paper -- including careful choice of fonts, puncs, etc. is a central part of the artistry -- as one can easily see by comparing a bad publication of this work to a good one! The good publications represent the work of the artist, the bad one's clearly do not. And a txt representation would be just so many chicken scratchings in the mud. I'm sure there are many here who would say "but I don't like Tristram Shandy" -- and that would be my point. By bringing a prejudice to the table that only texts worth representing in txt are worth representing, you prejudice what books PG is allowed to preserve, and you censor the choice of artists that others are permitted to preserve. You represent some artists, and consign the others to oblivion. From prosfilaes at gmail.com Mon Sep 14 20:47:15 2009 From: prosfilaes at gmail.com (David Starner) Date: Mon, 14 Sep 2009 23:47:15 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> Message-ID: <6d99d1fd0909142047u5a192b4cv1220494663969e8a@mail.gmail.com> On Mon, Sep 14, 2009 at 11:31 PM, Jim Adcock wrote: > Certainly a text like Tristram Shandy demonstrates there are books which are > NOT just about the words -- where rather, the artistry of representing word > on paper -- including careful choice of fonts, puncs, etc. is a central part > of the artistry -- as one can easily see by comparing a bad publication of > this work to a good one! Those sculptors who choose to work in ice are rarely remembered well by later ages. Sculptors who work in iron and bronze can easily be remembered for several millennia. The choice is the artist's. -- Kie ekzistas vivo, ekzistas espero. From ajhaines at shaw.ca Mon Sep 14 22:20:44 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Mon, 14 Sep 2009 22:20:44 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> Message-ID: <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> ----- Original Message ----- From: "Jim Adcock" To: "'Project Gutenberg Volunteer Discussion'" Sent: Monday, September 14, 2009 8:31 PM Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT > >....but as long as you get the words, who cares what the quote marks look > like? > > There are a lot of texts where you cannot "get" the words from just the > words. There are also texts with quotes within quotes, where if you don't > care what the quote marks look like _you cannot read it!_ > I think I, and any other followers of this thread, will need an example of "not getting the words from the words". I've seen any number of instances of nested quotes, mostly nested doublequotes, lots of triple-nested double-single-double quotes, and some triple-nested single-double-single quotes (mostly in British-published books) and I have yet to encounter any that I couldn't read, either in the original source or when they've been etexted. > Certainly a text like Tristram Shandy demonstrates there are books which > are > NOT just about the words -- where rather, the artistry of representing > word > on paper -- including careful choice of fonts, puncs, etc. is a central > part > of the artistry -- as one can easily see by comparing a bad publication of > this work to a good one! The good publications represent the work of the > artist, the bad one's clearly do not. And a txt representation would be > just so many chicken scratchings in the mud. > I've looked at PG's text and HTML version of Shandy, and several PDFed scansets in Internet Archive. Unless I'm missing something, they all look like standard prose to me. If you've got an edition as difficult to transcribe as you seem to indicate, and it's not in Internet Archive, you should scan it, and if you have no interest in producing it yourself, upload the zipped scanset via FTP to PG (I can give exact instructions to you privately). As long as it's clearable, it may be possible to arrange for it to go into PG's Preprints page where it'll be available as a project for someone. > I'm sure there are many here who would say "but I don't like Tristram > Shandy" -- and that would be my point. By bringing a prejudice to the > table > that only texts worth representing in txt are worth representing, you > prejudice what books PG is allowed to preserve, and you censor the choice > of > artists that others are permitted to preserve. You represent some > artists, > and consign the others to oblivion. > Personally, I'm book-agnostic--as long as it's in English, a book is a book is a book. I'm would assume that those who produce books for PG in other languages feel the same way about books in those languages. Distributed Proofreaders, at least once, has produced a book in a language none of its proofers understood (#27120). > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From marcello at perathoner.de Tue Sep 15 01:01:57 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 15 Sep 2009 10:01:57 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <44414895897048CF95F99FE12CC881FD@alp2400> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> Message-ID: <4AAF49F5.4090206@perathoner.de> Al Haines (shaw) wrote: > Text will never be dead. It's portable to all platforms, doesn't need a > browser or a PDF-like reader, only the most basic editor. Its not portable to cellphones. While every modern cellphone comes with a browser I have never seen one with an editor. -- Marcello Perathoner webmaster at gutenberg.org From schultzk at uni-trier.de Tue Sep 15 01:58:07 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Tue, 15 Sep 2009 10:58:07 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: Hi Everybody, I will step in here for a moment. As Bowerbird has mentioned this discussion is as old as PG itself. The problems are: 1) Plain Vanilla Texts can not reproduce books (It is not meant, too) 2) PG does NOT have a comprehensive format for reproducing books. 3) PG has not evolved with mopdern computer technology. 4) Ecerybody wants thier pet formats for reading. 5) PG does not have a consolidated following willing to build the resources needed to solve the above. There are many various reason for the above problems. Yes, there ARE and have been efforts to solve the above. Yet, none of these have fruited much or have been able to satisfy needs of all its contributors or users. So what is needed: 1) A single modular and extensible format for encoding the books a) the structures in the book (text) need to be represented b) it does not presume a particular output format c) does not care about the size of files d) does not need to be very readable easily 2) a parser for creating output formats a ) use all information to create the best possible output for a particular format 3) an editor a) display the book b) allow for changes in the representation of the book c) must be modular and extensible 4) a parser for creating the representation of the book in the format from scans a) must be modular and extensible b) must be multi-pass c) flags possible conflicts with the format d) intelligent to do most markup by itself e) intelligent to correct common errors by itself 5) parsers for converting older formats a) all of 4) b) does not expect particular information c) allows for presets injorder to same time and desirable representation. 6) a proofing workflow So what do we have. We need a a format that is not based on an existing format, is modualr and extensible. Either we start from scratch or use a generic format. SGML or XML come to mind. We can then put in waht we want and need, have a well structured format, can extend it easily and it is modular. Plus, XML can handle all kind of information an data. Yes, we have to reinvent the wheel for markup, but we want a representation that contains as much information as possible. The question would be how much is needed. At least the markup will be a layout format. It should only take about a month to create such a format. The other parts will take a little longer. The important thing is everything has to be centered around the representation format and not the output. The output is handle by parsers. Where a particular output format can handle or represent a particular feature can be a concern of the PG internal representation. The developers of the output format can converted it to what ever the seem fittest. regards Keith. From schultzk at uni-trier.de Tue Sep 15 02:09:05 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Tue, 15 Sep 2009 11:09:05 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAF49F5.4090206@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <4AAF49F5.4090206@perathoner.de> Message-ID: <2BC18EEF-94F6-43C9-BB98-579E73E5298B@uni-trier.de> Hi, Am 15.09.2009 um 10:01 schrieb Marcello Perathoner: > Al Haines (shaw) wrote: > >> Text will never be dead. It's portable to all platforms, doesn't >> need a browser or a PDF-like reader, only the most basic editor. True text will never go away! Yet, the way it is represented will change. It also, depends on how you define TEXT. > > > Its not portable to cellphones. Strange? I get text messages all the time ;-)) > > While every modern cellphone comes with a browser I have never seen > one with an editor. I have a editor on mine. iPhones and Blackberries have them. Of course the are not modern. ;-)) regards Keith. From marcello at perathoner.de Tue Sep 15 02:10:42 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 15 Sep 2009 11:10:42 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> Message-ID: <4AAF5A12.8040505@perathoner.de> David Starner wrote: >> The reason that there isn't a TEI file posted along with *every* ebook is >> that most PPers at DP don't care to produce one. > > And the reason for that is not only is it a lot more work than an HTML > edition, unsupported by any sort of tools, it's worthless to the end > user, as apparently no one at PG can get decent output from it. Thats a lot of misinformation in such a short paragraph. 1. More work ... Of course you have to learn TEI, as you had to learn HTML. No difference there. Once you have mastered it, it is actually a lot less work, because TEI was designed for text preservation, while HTML was designed to bring scientific papers online. It is also less work because from one master you get the HTML, the TXT and the PDF. It is also less work fixing errata, because you fix the master instead of having to fix 2 or 3 different files. 2. Unsupported by tools ... PG has an implementation of TEI. I know you don't like it because you haven't figured out how to produce pretty title pages. But you don't have to use that one, there are plenty other. TEI is being used by many projects: http://www.tei-c.org/Activities/Projects/ and has a full suite of tools: http://wiki.tei-c.org/index.php/Category:Tools 3. Worthless to the end user ... TEI is a master format. Its use is in producing formats suitable for end-user consumption. Anf if we don't equate end user == reader but try: end user == librarian or end user == lunguistic researcher we find that TEI is many times as useful as HTML. 4. No decent output ... `Decentness? is a matter of debate. At DP some PPers think it is essential to use every CSS feature at least once in every text, having pictures float right and left and text flowing around them and having illuminated dropcaps and printers ornaments and page numbers all over the place. PGTEI cannot (yet) do that. I very much prefer a simple layout, with only essential pictures smack in the middle of the text flow at the point they logically belong. A formatting that is easily ported to all existing devices. PGTEI excels at this. Ironically `decent? DP output is already falling to pieces on ePub devices (not even to mention Mobipocket) because ePub does not support CSS position: absolute. -- Marcello Perathoner webmaster at gutenberg.org From marcello at perathoner.de Tue Sep 15 02:20:08 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 15 Sep 2009 11:20:08 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: <4AAF5C48.1040201@perathoner.de> Keith J. Schultz wrote: > We need a a format that is not based on an existing format, ... Why not? > ... but we want a representation that contains as much information as > possible. > It should only take about a month to create such a format. ROTFL -- Marcello Perathoner webmaster at gutenberg.org From pterandon at gmail.com Tue Sep 15 04:02:16 2009 From: pterandon at gmail.com (Greg M. Johnson) Date: Tue, 15 Sep 2009 07:02:16 -0400 Subject: [gutvol-d] World's most heavily pirated books Message-ID: I was listening to some podcast (I have since forgotten exactly which one, but it was about the celebration of sci-fi culture) where they talked about "the world's most pirated books." At first one of the hosts went into a little tirade that *book *piracy was the purest form of evil-- worse than any other, I guess because the book publishers weren't as evil or something. The list did include a recent book about Photoshop, which is unfortunate. But the list also included *The Kama Sutra*, and a few other really old classics. Eventually it dawned on me, if not entirely to both hosts as well, that they were conflating "piracy" with "downloading by bit torrent." < s i g h >. -- Greg M. Johnson http://pterandon.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Sep 15 14:34:17 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Sep 2009 17:34:17 EDT Subject: [gutvol-d] z.m.l. can do what you want Message-ID: z.m.l. can do what you want. as soon as you're ready to act, and not just yak. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Tue Sep 15 16:18:57 2009 From: jimad at msn.com (James Adcock) Date: Tue, 15 Sep 2009 16:18:57 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: >I think I, and any other followers of this thread, will need an example of "not getting the words from the words". Okay, let's go over a number of simple examples: Consider Michael's thesis of the "goodness" of viewing PG texts on cellphones. Which is a "good" submission format for submitting a transcription of Shakespeare to be read on a cellphone, PG txt format, or HTML? Answer: Neither file format works worth a dang for specifying Shakespeare to be read on cellphones. Yet both file formats contain the lists of the words. -- Even 400 years ago authors understood the importance of formatting and printing decisions to represent the meanings of words -- artistic writings ARE NOT just lists of words -- even when those words are clearly intended to be spoken out loud. Here's a brief excerpt from Dove: "'Go?'" he wondered. "Go when, go where?" And another one: She particularly likes you. Yes, you can read these words and you will assign meanings to these words but you will not get the author's intent because the author understood that he needed to put additional information in the printing so that you can understand his intent. This is particularly important in the Henry James because what he is writing is deliberately ambiguous and confusing in the first place, so much so that he has to disambiguate in order to reduce the degree of ambiguity in what he is writing -- while still deliberately leaving the reader dazed and confused -- but not so confused as to think (incorrectly) that they understand what is going on. I guess I can put some txt representation of Tristram Shandy here, but what would be the point? He's gone! said my uncle Toby Where? Who? cried my father My nephew, said my uncle Toby What, without leave, without money, without governor? cried my father in amazement No he is dead, my dear brother, quoth my uncle Toby Without being ill? cried my father again I dare say not, said my uncle Toby, in a low voice, and fetching a deep sigh from the bottom of his heart; he has been ill enough, poor lad! I'll answer for him, for he is dead. Yes, once again, you can read the words and you will assign meaning to them -- but not the meaning intended by the author, because the txt is missing information that the author found important to include so that you can understand his meaning -- to the extent that he wanted you to understand his meaning which again was partial in the first place. I'm not saying that there is no place in the world for txt -- as archy demonstrated clearly back in 1916: expression is the need of my soul And you can read this entire email and still come back and complain that you don't understand what I am talking about and in making this complaint you once again make my point for me: The reason that you don't understand what I am talking about is that I am writing this email using txt and the authors given as examples above were writing in a style requiring representation richer than mere PG txt. Go and find the author's original representations and read them there because PG txt simply doesn't cut it to represent their work. Read what they wrote and ask yourself what it takes to actually implement the author's intent, either automagically, or even semiautomagically, on a variety of differing reader devices -- including, but not limited to -- teletypes and their software emulators [which is essentially what txt devices are, including this email system and notepad, etc PS: If you can read this email at all please note that it's because I *didn't* write it following PG txt conventions. From jimad at msn.com Tue Sep 15 16:38:54 2009 From: jimad at msn.com (James Adcock) Date: Tue, 15 Sep 2009 16:38:54 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: ...So what is needed... Yes, except I don't think it's as bad as you make it out to be. TEI and/or PG-TEI could be a good intermediate formal file format. DP markup [and conventions] could be a good preliminary editing markup format. Editing doesn't necessarily need to be WYSIWYG. Input formatted files don't have to be perfect since they are living documents, as opposed to current "write once" output formatted files. Conversion from an input format file to output rendering formats such as txt or html or the various other reflow formats doesn't have to be perfect -- as long as the input format to output format rendering software does more work than the current tools for the job -- which basically is none. You probably have to store CSS or other style choices representation to help reconstruct how the original volunteers chose to render the input file format to the output rendering file format. [Where I am assuming here that html is simply being used as an output rendering file format, so that we don't have to argue anymore about the "correct" semantic use of html -- we would say that the semantics are being represented in the input file format, not in the html] Again, this is all trying to address at least three problems: 1) How do you represent the author's intention without deliberately throwing away information? 2) How do you make the files submitted by volunteers be "living documents" rather than "write once" documents -- which other volunteers can pick up and improve on in the future without having to go back to original scans and rework the work "from scratch" ? 3) How do you support as best as possible various output rendering file formats most appropriate for various reader devices? -- of which PG *already* "officially" recognizes literally about 80 different output file formats of differing complexities! From Bowerbird at aol.com Tue Sep 15 19:05:25 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Sep 2009 22:05:25 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim, i don't think you're saying anything that hasn't already been said here before. many times. over the course of years. i just think you're saying it less clearly... but hey, if anyone thinks jim _is_ saying something that can use some attention, please do tell us just exactly what it is... thanks. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Tue Sep 15 20:14:42 2009 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Tue, 15 Sep 2009 23:14:42 -0400 Subject: [gutvol-d] Re: z.m.l. can do what you want In-Reply-To: References: Message-ID: <4AB05822.2020906@teksavvy.com> Bowerbird at aol.com wrote: > z.m.l. can do what you want. > I know BB talks a good deal about ZML and it sounds pretty cool, but I've googled till my fingers bleed and I cannot tell for sure what exact ZML he has in mind. This ZML? http://rx4rdf.liminalzone.org/ZMLMarkupRules Or this one? http://sourceforge.net/projects/zeitung-ml/ Or this? http://www.seas.gwu.edu/~bell/publications/zml-report.pdf Or this? http://nt-appn.comp.nus.edu.sg/fm/zml/ There's a lot to choose from. The first one includes something resembling a specification and a Sourceforge project, so I bet that's it. Right? ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From jimad at msn.com Tue Sep 15 19:49:13 2009 From: jimad at msn.com (Jim Adcock) Date: Tue, 15 Sep 2009 19:49:13 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAF5A12.8040505@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> Message-ID: > PG has an implementation of TEI. How does one learn more and/or access the "PG implementation of TEI." I have seen PG TEI which looks to me to add some tags to the base TEI ? Again, TEI P5 is 1350 pages, which is a lot more inaccessible to volunteers than anything describing HTML tags that I have seen! I'd say the DP tagging documentation is already painful enough for most of us. I am about 100 pages into the TEI documentation, so in maybe two weeks I can tell you more about what I think about it.... From jimad at msn.com Tue Sep 15 19:56:07 2009 From: jimad at msn.com (Jim Adcock) Date: Tue, 15 Sep 2009 19:56:07 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAF5C48.1040201@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> Message-ID: As an example, I just tried auto-magically unwrapping some PG txt because I don't like the char count per line choices forced by PG and the assumed size of the txt display that PG assumes -- which is NOT the size of MY txt display. This is what then ended up being displayed on MY choice of txt display, once I applied the txt unwrapping algorithm: Ham. To be, or not to be,--that is the question:-- Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles, And by opposing end them? --To die,--to sleep,--No more; and by a sleep to say we end The heartache, and the thousand natural shocks That flesh is heir to,--'tis a consummation Devoutly to be wish'd. To die,--to sleep;-- To sleep! perchance to dream:--ay, there's the rub; For in that sleep of death what dreams may come, When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despis'd love, the law's delay, The insolence of office, and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would these fardels bear, To grunt and sweat under a weary life, But that the dread of something after death,-- The undiscover'd country, from whose bourn No traveller returns, --puzzles the will, And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought; And enterprises of great pith and moment, With this regard, their currents turn awry, And lose the name of action.--Soft you now! The fair Ophelia! --Nymph, in thy orisons Be all my sins remember'd. Now maybe to some of you -- you consider this result to be a good thing, an acceptable thing, a thing that well-represents the considerable efforts of the PG volunteers. But personally, I do not think so. From jimad at msn.com Tue Sep 15 15:05:52 2009 From: jimad at msn.com (Jim Adcock) Date: Tue, 15 Sep 2009 15:05:52 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <6d99d1fd0909142047u5a192b4cv1220494663969e8a@mail.gmail.com> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <6d99d1fd0909142047u5a192b4cv1220494663969e8a@mail.gmail.com> Message-ID: >Those sculptors who choose to work in ice are rarely remembered well by later ages. Sculptors who work in iron and bronze can easily be remembered for several millennia. The choice is the artist's. Except when what we are talking about is transcribers scratching other artist's works into mud tablets with (at best) a pointy stick. From ajhaines at shaw.ca Tue Sep 15 22:02:10 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 15 Sep 2009 22:02:10 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> Message-ID: <6BB9B881D0244685AE0AF5E602017D03@alp2400> http://pgtei.pglaf.org/ ----- Original Message ----- From: "Jim Adcock" To: "'Project Gutenberg Volunteer Discussion'" Sent: Tuesday, September 15, 2009 7:49 PM Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT >> PG has an implementation of TEI. > > How does one learn more and/or access the "PG implementation of TEI." I > have seen PG TEI which looks to me to add some tags to the base TEI ? > > Again, TEI P5 is 1350 pages, which is a lot more inaccessible to > volunteers > than anything describing HTML tags that I have seen! I'd say the DP > tagging > documentation is already painful enough for most of us. I am about 100 > pages into the TEI documentation, so in maybe two weeks I can tell you > more > about what I think about it.... > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From ajhaines at shaw.ca Tue Sep 15 22:13:09 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 15 Sep 2009 22:13:09 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> Message-ID: It's clearly stated in PG Volunteers' FAQ V.89 (http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F) that if you want to prevent unwanted wrapping, lines that should not be wrapped should be indented a space or two. In PG's older etexts, that predated this standard, the technique was used only sporadically. However, whenever an older text is cleaned up and reposted, it *is* applied where necessary. ----- Original Message ----- From: "Jim Adcock" To: "'Project Gutenberg Volunteer Discussion'" Sent: Tuesday, September 15, 2009 7:56 PM Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT > As an example, I just tried auto-magically unwrapping some PG txt because > I > don't like the char count per line choices forced by PG and the assumed > size > of the txt display that PG assumes -- which is NOT the size of MY txt > display. This is what then ended up being displayed on MY choice of txt > display, once I applied the txt unwrapping algorithm: > > Ham. To be, or not to be,--that is the question:-- Whether 'tis > nobler in the mind to suffer The slings and arrows of outrageous > fortune Or to take arms against a sea of troubles, And by > opposing end them? --To die,--to sleep,--No more; and by a sleep > to say we end The heartache, and the thousand natural shocks That > flesh is heir to,--'tis a consummation Devoutly to be wish'd. To > die,--to sleep;-- To sleep! perchance to dream:--ay, there's the > rub; For in that sleep of death what dreams may come, When we have > shuffled off this mortal coil, Must give us pause: there's the > respect That makes calamity of so long life; For who would bear > the whips and scorns of time, The oppressor's wrong, the proud > man's contumely, The pangs of despis'd love, the law's delay, The > insolence of office, and the spurns That patient merit of the > unworthy takes, When he himself might his quietus make With a bare > bodkin? who would these fardels bear, To grunt and sweat under > a weary life, But that the dread of something after death,-- > The undiscover'd country, from whose bourn No traveller returns, > --puzzles the will, And makes us rather bear those ills we have > Than fly to others that we know not of? Thus conscience does make > cowards of us all; And thus the native hue of resolution Is > sicklied o'er with the pale cast of thought; And enterprises of > great pith and moment, With this regard, their currents turn awry, > And lose the name of action.--Soft you now! The fair Ophelia! > --Nymph, in thy orisons Be all my sins remember'd. > > Now maybe to some of you -- you consider this result to be a good thing, > an > acceptable thing, a thing that well-represents the considerable efforts of > the PG volunteers. > > But personally, I do not think so. > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From jimad at msn.com Tue Sep 15 23:12:23 2009 From: jimad at msn.com (James Adcock) Date: Tue, 15 Sep 2009 23:12:23 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: As an example of how much author semantic information is lost going from an author's writing to PG txt format, I went and compared differences between a recent HTML and PG TXT I did -- where after doing the TXT encoding I went back and did three more passes over the images to add back in semantic differences to the HTML that the PG TXT didn't represent. Now the reality would be that it would take say TEI not HTML to represent all of the author's intent. But measuring the loss going from HTML back to TXT gives an order of magnitude estimate of how much author information we are throwing away by representing a work in PG TXT. In the case of this book, the answer was more than 1000 "losses" -- or an average of about 3 losses per page. And this is NOT counting about an addition 1000 losses in representation of emphasis. Now, let's say we have a PG TXT and some volunteer in the future wants to go back from that txt and say as correctly as possible represent that text using PDF. How many "errors" does that volunteer need to correctly find where the TXT file loses author's semantic information by carefully comparing the page images to the PG TXT file, reintroducing information known to the original volunteer transcribers, but discarded as not being representable in PG TXT? The answer is that this volunteer has to find and fix the txt in literally about 2000 places. Want to place a bet on how many of those 2000 places the volunteer trying to create an accurate PDF file is actually going to "catch" ??? I can tell you in my efforts going from PG TXT to HTML in the first place it's a good part of a week's work -- not to imply *I* caught them all either! From sankarrukku at gmail.com Wed Sep 16 00:41:24 2009 From: sankarrukku at gmail.com (Sankar Viswanathan) Date: Wed, 16 Sep 2009 13:11:24 +0530 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: The PG texts are produced by Volunteers. Individual producers and the post processors of D.P. Text file is only the minimum requirement stipulated by PG. It is upto the independent producers and the post-processors of DP to decide in what formats the book should be submitted. PG has no control over the format submitted. The White Washers check the files and post them. TEI is not popular either with the independent producers or the post processors. We could discuss this till the cows come home. But the solution is in the hands of the independent producers and post processors of DP. On Wed, Sep 16, 2009 at 11:42 AM, James Adcock wrote: > As an example of how much author semantic information is lost going from an > author's writing to PG txt format, I went and compared differences between > a > recent HTML and PG TXT I did -- where after doing the TXT encoding I went > back and did three more passes over the images to add back in semantic > differences to the HTML that the PG TXT didn't represent. > > Now the reality would be that it would take say TEI not HTML to represent > all of the author's intent. But measuring the loss going from HTML back to > TXT gives an order of magnitude estimate of how much author information we > are throwing away by representing a work in PG TXT. In the case of this > book, the answer was more than 1000 "losses" -- or an average of about 3 > losses per page. And this is NOT counting about an addition 1000 losses in > representation of emphasis. > > Now, let's say we have a PG TXT and some volunteer in the future wants to > go > back from that txt and say as correctly as possible represent that text > using PDF. How many "errors" does that volunteer need to correctly find > where the TXT file loses author's semantic information by carefully > comparing the page images to the PG TXT file, reintroducing information > known to the original volunteer transcribers, but discarded as not being > representable in PG TXT? The answer is that this volunteer has to find and > fix the txt in literally about 2000 places. Want to place a bet on how > many > of those 2000 places the volunteer trying to create an accurate PDF file is > actually going to "catch" ??? I can tell you in my efforts going from PG > TXT to HTML in the first place it's a good part of a week's work -- not to > imply *I* caught them all either! > > > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -- Sankar Service to Humanity is Service to God -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 16 01:09:11 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 04:09:11 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > Again, TEI P5 is 1350 pages, which is a lot more inaccessible > to volunteers than anything describing HTML tags that I have seen!? > I'd say the DP tagging documentation is already painful enough > for most of us.? I am about 100 pages into the TEI documentation, so > in maybe two weeks I can tell you more about what I think about it... wow, jim looks to be a bit masochistic. could be a perfect candidate. :+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 16 01:22:44 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 04:22:44 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > Now maybe to some of you -- you consider this result > to be a good thing, an acceptable thing, a thing that > well-represents the considerable efforts of the PG volunteers. i doubt there is anyone who thinks that. what you ended up with is pure shit. that's because you did it wrong. you rewrapped lines that weren't supposed to be rewrapped. if you would have done it correctly, it would've come out right. but you did it wrong. now, your point is probably that we should make it easier for our users to do it correctly. and nobody will disagree... we _should_ make it easier for our users to do it correctly. and there's an (awfully) easy way to make it easier, which is to mark all lines that should not be rewrapped with leading spaces. but the whitewashers won't do it. why won't the whitewashers do it? i dunno. you'll have to ask them. i've certainly asked them. i've asked them to do it, pretty please. i've asked them again to do it, pretty please. i've asked 'em why they haven't done it. i've said, repeatedly, that i think it's stupid they haven't done it. and they still don't do it. not all the time, anyway. they do it some of the time. i consider that a very slight victory. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 16 01:32:55 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 04:32:55 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > In the case of this book, the answer was more than 1000 "losses" -- > or an average of about 3 losses per page.? And this is NOT counting > about an addition 1000 losses in representation of emphasis. what was the book? i'd like to compare the versions myself. the question is, "why are you having _any_ losses in the .txt files?" the answer, i am sure, will once again be, "you're doing it wrong". it sucks, yeah, but it will be important to fix your broken workflow. seriously, if you are stripping out emphasis, you're making a mistake. (unless the "emphasis" had no meaning, and was simply ornate decor.) in the meantime, if you do things intentionally that harm the .txt file, then yes, the .txt file is going to seem awfully incapable to you, so it's not a big surprise that you keep wanting to insist that such is the case. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Wed Sep 16 02:03:08 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 16 Sep 2009 11:03:08 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> Message-ID: <4AB0A9CC.4020608@perathoner.de> Jim Adcock wrote: > How does one learn more and/or access the "PG implementation of TEI." I > have seen PG TEI which looks to me to add some tags to the base TEI ? Adds some very few, restricts some others, specifies the usage of the rend attribute. > Again, TEI P5 is 1350 pages, which is a lot more inaccessible to volunteers > than anything describing HTML tags that I have seen! I'd say the DP tagging > documentation is already painful enough for most of us. I am about 100 > pages into the TEI documentation, so in maybe two weeks I can tell you more > about what I think about it.... Don't read the full TEI Guidelines. Read about TEI-Lite: http://www.tei-c.org/release/doc/tei-p5-exemplars/html/teilite.doc.html which is a lot shorter than the HTML4 specs. You don't even have to use all of TEI Lite. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Wed Sep 16 02:14:24 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 05:14:24 EDT Subject: [gutvol-d] Re: z.m.l. can do what you want Message-ID: gardner said: > I know BB talks a good deal about ZML > and it sounds pretty cool wow, that almost sounds like it could be a compliment. > but I've googled till my fingers bleed and > I cannot tell for sure what exact ZML he has in mind. oh, you just should have asked me, gardner. google can't find everything, especially if something ain't all that concerned with being found by google... on the other hand, it's pretty easy to remember it: > http://z-m-l.com and oh yeah, that stands for "zen markup language". *** at the site, click "see some examples." which takes you to: > http://www.z-m-l.com/go/vl3.pl you've come to a page that has filename links on the left, and then a respective button for each one of those files... let's say you click the link for the top file -- test-suite.zml: > http://www.z-m-l.com/go/test-suite.zml make sure to click the text link -- don't click the button yet. the file you're viewing -- the test-suite for project gutenberg -- has been "marked up" with z.m.l. -- zen markup language. (z.m.l. is "zen", so you won't actually "see" any markup, at least not any anglebrackets with tags inside of them. but you'll see that it's formatted regularly; that's z.m.l.) i've appended the topmost lines from that file -- the cover page and some lines from the contents -- to this post. once again, this is the .zml file... *** now click the "back" button to go back to this page: > http://www.z-m-l.com/go/vl3.pl this time, click on the "test-suite" button on the right side. this will perform the conversion of the .zml file into .html, and take you to the resultant .html file right on the web... text colors are used to signify different structural elements. you can save this .html to your own machine to examine it. or put it side-by-side with the .zml file to see the conversion. this .html, with its c.s.s., certainly won't be your cup-of-tea, but changing the c.s.s. is a mere matter of template editing. it validates as is, to .html 4.01, which might work for kindle, or might not, i haven't done any checking on that specifically, but again, we modify the conversion by editing the template. i pointed you to the test-suite first, because reading it will give you an introduction to the overall philosophy of it all... also, the second file is "the 11 rules of z.m.l.", which might also give you a good orientation, even if the file is quite old. (but as i'm reluctant to tie anything down, it's not outdated.) i can also turn out a mean .pdf, with that conversion routine, and since you can size the .pdf page (and all other variables, like font, fontsize, leading, margins, etc.) however you want, that ends up being quite usable on a fixed-size e-ink screen. so we have .html (important in mounting a web version), and .pdf (for those situations where it's the user-chosen solution), but the best part of all is that zml-viewers are easy to program. i've coded them in basic and perl, and python should be simple. it's very fundamental coding, so it'll be portable to any language. it also facilitates open-source efforts, because it's easy to hack... zml-viewer-programs turn the .zml file into a beautiful e-book, customized to the user's preference, and offering a wide variety of high-functionality capabilities that make a .zml file powerful. this combination of beauty and power make z.m.l. hard to beat. -bowerbird p.s. the top of the test-suite file goes like this: the test-suite for project gutenberg a document containing the full range of features found in project gutenberg e-texts by bowerbird intelligentleman greetings, earthling... this is an e-text brought to you by project gutenberg, a 35-year-old volunteer effort to put literature online. please see the web-site for news and information on usage conditions for e-texts, volunteering, and more... http://www.gutenberg.org table of contents the test-suite for project gutenberg table of contents dedication chapter 1 -- welcome aboard chapter 2 -- the sections of the book chapter 3 -- text "styling" -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Wed Sep 16 02:19:03 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 16 Sep 2009 11:19:03 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> Message-ID: <4AB0AD87.4050704@perathoner.de> Al Haines (shaw) wrote: > It's clearly stated in PG Volunteers' FAQ V.89 > (http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F) > > that if you want to prevent unwanted wrapping, lines that should not be > wrapped should be indented a space or two. This `markup? does not distinguish between poetry and a block quote. A block quote should be indented *and* rewrapped. And the Rewrap Blues is only part of the problem ... Another formidable challenge is to recover the chapter headings and other headings to make them stand out and to build a TOC. -- Marcello Perathoner webmaster at gutenberg.org From sankarrukku at gmail.com Wed Sep 16 03:56:08 2009 From: sankarrukku at gmail.com (Sankar Viswanathan) Date: Wed, 16 Sep 2009 16:26:08 +0530 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AB0AD87.4050704@perathoner.de> References: <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> <4AB0AD87.4050704@perathoner.de> Message-ID: DP produces TEI text. But very few post processors take to the TEI route. Why? The software Guiguts automatically converts the formatted text to html. You need not know much about HTML. The html output only needs to be tweaked at times. Even that is not necessary in all cases. Even with this scenario there has been a reluctance on the part of many post processors to do a html version. DP does not insist on a html version. But most of the Project Managers do insist on a html version. Even then there are DP projects which are posted only in the text format. For TEI to become popular we need a software which would automatically convert the TEI text to a final TEI version. Is it possible? I saw a software here. How good is it? http://www.tei-c.org/Talks/Forli/2006/conversion.xml On Wed, Sep 16, 2009 at 2:49 PM, Marcello Perathoner wrote: > Al Haines (shaw) wrote: > > It's clearly stated in PG Volunteers' FAQ V.89 ( >> http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F) >> >> that if you want to prevent unwanted wrapping, lines that should not be >> wrapped should be indented a space or two. >> > > This `markup? does not distinguish between poetry and a block quote. A > block quote should be indented *and* rewrapped. > > > And the Rewrap Blues is only part of the problem ... > > Another formidable challenge is to recover the chapter headings and other > headings to make them stand out and to build a TOC. > > > > -- > Marcello Perathoner > webmaster at gutenberg.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -- Sankar Service to Humanity is Service to God -------------- next part -------------- An HTML attachment was scrubbed... URL: From traverso at posso.dm.unipi.it Wed Sep 16 04:45:00 2009 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Wed, 16 Sep 2009 13:45:00 +0200 (CEST) Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: (message from Sankar Viswanathan on Wed, 16 Sep 2009 16:26:08 +0530) References: <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> <4AB0AD87.4050704@perathoner.de> Message-ID: <20090916114500.232CB10074@cardano.dm.unipi.it> People use guiguts because it produces HTML, but mainly because it includes gutcheck, aspell, wordcount routines, and integrates display of the text and of the image corresponding to the text cursor position. The only route to have more TEI submissions is to have a version of guiguts producing TEI instead of HTML. And of course improve the automatic conversion from TEI to HTML Carlo From marcello at perathoner.de Wed Sep 16 05:40:47 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 16 Sep 2009 14:40:47 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <20090916114500.232CB10074@cardano.dm.unipi.it> References: <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> <4AB0AD87.4050704@perathoner.de> <20090916114500.232CB10074@cardano.dm.unipi.it> Message-ID: <4AB0DCCF.2000205@perathoner.de> Carlo Traverso wrote: > People use guiguts because it produces HTML, but mainly because it > includes gutcheck, aspell, wordcount routines, and integrates display > of the text and of the image corresponding to the text cursor > position. The only route to have more TEI submissions is to have a > version of guiguts producing TEI instead of HTML. That should be trivial. > And of course improve the automatic conversion from TEI to HTML In my copius free time ... -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Wed Sep 16 08:27:45 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 11:27:45 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: carlo said: > The only route to have more TEI submissions is > to have a version of guiguts producing TEI instead of HTML. yeah, right. extend the rickety workflow right out the window. that's the ticket. that will make everything all better, for sure... and thus again we learn to appreciate the open-source approach. > And of course improve the automatic conversion from TEI to HTML and the automatic .pdf conversion. and all the other conversions. or, you know, just use all the "standard" routines already out there, in this thoroughly-explored and well-documented standards arena. or hey, maybe once we get an e-book into .tei form, we can just stand back and admire it, and pat our backs on our achievement. who needs to convert to .html, when we can just bask in our glory? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 16 08:55:17 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 11:55:17 EDT Subject: [gutvol-d] re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > Which is a "good" submission format for submitting a > transcription of Shakespeare to be read on a cellphone, > PG txt format, or HTML? > Answer: Neither file format works worth a dang for > specifying Shakespeare to be read on cellphones. > Yet both file formats contain the lists of the words. again, jim, you've come up with the wrong answer... as i've said, repeatedly, the iphone app "eucalyptus" does a great job of rendering the p.g. e-texts in a way that makes them quite beautiful, according to reviewers, and i agree, for the most part. (it ends up eucalyptus is kind of flawed as an e-book program when it comes to some capabilities that i consider vital, such as _search_. but in terms of rendering the pages, it does that nicely.) format wonks think the format needs to describe beauty... it's far better, however, for a viewer-application to elicit it. because in the end, everything really depends on the viewer. > Here's a brief excerpt from Dove: > "'Go?'" he wondered. "Go when, go where?" well, that's on page 227 of this version in google. > "And proceed to my business under your eyes?" > "Oh dear no -- we shall go." > "'Go?'" he wondered. "Go when, go where?" > " In a day or two -- straight home. Aunt Maud wishes it now." http://books.google.com/books?id=B9AOAAAAIAAJ&client=safari&pg=PA227& ci=89%2C537%2C776%2C177&source=bookclip" http://books.google.com/books?id=B9AOAAAAIAAJ&pg=PA227&img=1&zoom=3&hl=en& sig=ACfU3U0L1tXMRlU83MbayfHNmmTl27CrXw&ci=89%2C537%2C776%2C177&edge=0"/ but i don't see anything special about that, anything that would be missed or lost in the plain-text version. > And another one: > She particularly likes you. ok, that's on page 18. the "you" is italicized, so you should have put underscores around it, so the viewer knows it's emphasized. it's also embedded in dialog, so the way you have pulled it out -- as if it was a paragraph by itself -- is rather misleading here. again, if you purposely disfigure the plain-text version, then yes, it will be inferior. your solution is to stop disfiguring it... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Sep 16 09:21:52 2009 From: prosfilaes at gmail.com (David Starner) Date: Wed, 16 Sep 2009 12:21:52 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAF5A12.8040505@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> Message-ID: <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> On Tue, Sep 15, 2009 at 5:10 AM, Marcello Perathoner wrote: > Of course you have to learn TEI, as you had to learn HTML. No difference > there. But we know HTML, for one. We also have tools that help us with HTML, for two. For three and the strike-out, I have a host of tools that will help me edit, verify and view HTML, but there is no Debian packages for PGTEI. Yes, yes, if I want to spend my hours mucking around with stuff, I can in theory get it all installed. > PG has an implementation of TEI. I know you don't like it because you > haven't figured out how to produce pretty title pages. Note: by "pretty title pages" Marcello means a title page that looks like any title page in an actual book. Once again, I grabbed the nearest books; I have ten books, by ten different publishers, including two in Esperanto and one in a mixture of Esperanto and Chinese, and with the exception of one of the English books which right-justifies its title page, they all follow the basic format of centered pages, title (new line) author (bottom of page) publisher. None of them look a darn thing like the title pages PGTEI prints out. > and has a full suite of tools: > > ?http://wiki.tei-c.org/index.php/Category:Tools I see "To install the filter(s), start Open Office and and follow the Tools / XML Filter Settings menu. Choose Open Package and locate the .jar file(s)." Again, no difference at all from stuff that comes preinstalled. > 3. Worthless to the end user ... > > TEI is a master format. Its use is in producing formats suitable for > end-user consumption. Then prove it. If I saw a single document produced from PGTEI that was suitable for end-user consumption, I might support it. Look damnit, I was a fan of TEI until I realized that the people who were going to bring it to PG didn't give a damn about making the output something we wanted people to see. > Anf if we don't equate end user == reader but try: end user == librarian or > end user == lunguistic researcher we find that TEI is many times as useful > as HTML. The librarian is never the end-user. The librarian is the person who makes it available to the end-user. Nobody around here cares about the linguistic researcher as the end user, and we will never produce files that are marked up with the type of information--like distinguishing sentence ending punctuation from the same punctuation used other ways--that they need. The end user we're targeting is the reader. > 4. No decent output ... > > `Decentness? is a matter of debate. Which is why you blow at selling this. Until you accept that PGTEI needs to produce output that meets the standards of the people you're trying to sell it to, nobody cares. > At DP some PPers think it is essential to use every CSS feature at least > once in every text, having pictures float right and left and text flowing > around them and having illuminated dropcaps and printers ornaments and page > numbers all over the place. PGTEI cannot (yet) do that. > > I very much prefer a simple layout, with only essential pictures smack in > the middle of the text flow at the point they logically belong. A formatting > that is easily ported to all existing devices. PGTEI excels at this. Yes, in fact, some PPers do want to produce an etext that replicates the original, includes the important illustrated dropcaps (that are frequently as much a part of the illustration of the book as any other illustration) and page numbers (that are crucial for much of the non-fiction that we reproduce, especially if you want to follow the web of references from one PG era book to another.) > Ironically `decent? DP output is already falling to pieces on ePub devices > (not even to mention Mobipocket) because ePub does not support CSS position: > absolute. And if you had produced TEI output that could do what people wanted to do, it's possible that we would have better output on the ePub devices. Right now, I would be surprised to find that PGTEI can output at all to ePub, and I wouldn't be surprised if the people who produced the DP output were happier with the results of their HTML translated to ePub than your HTML translated to ePub. -- Kie ekzistas vivo, ekzistas espero. From prosfilaes at gmail.com Wed Sep 16 09:25:04 2009 From: prosfilaes at gmail.com (David Starner) Date: Wed, 16 Sep 2009 12:25:04 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <6d99d1fd0909142047u5a192b4cv1220494663969e8a@mail.gmail.com> Message-ID: <6d99d1fd0909160925l40a2da13t3d5f0d4ffea3227@mail.gmail.com> On Tue, Sep 15, 2009 at 6:05 PM, Jim Adcock wrote: >>Those sculptors who choose to work in ice are rarely remembered well > by later ages. Sculptors who work in iron and bronze can easily be > remembered for several millennia. The choice is the artist's. > > Except when what we are talking about is transcribers scratching other > artist's works into mud tablets with (at best) a pointy stick. Which is exactly what happened to Gilgamesh. I suppose the author should have thrown a temper tantrum and demanded it be written only on the finest silk, in which case we wouldn't have a copy. -- Kie ekzistas vivo, ekzistas espero. From prosfilaes at gmail.com Wed Sep 16 09:26:27 2009 From: prosfilaes at gmail.com (David Starner) Date: Wed, 16 Sep 2009 12:26:27 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AB0DCCF.2000205@perathoner.de> References: <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> <4AB0AD87.4050704@perathoner.de> <20090916114500.232CB10074@cardano.dm.unipi.it> <4AB0DCCF.2000205@perathoner.de> Message-ID: <6d99d1fd0909160926u639042d3mb908f5e44dff1453@mail.gmail.com> On Wed, Sep 16, 2009 at 8:40 AM, Marcello Perathoner wrote: > Carlo Traverso wrote: >> And of course improve the automatic conversion from TEI to HTML > > In my copius free time ... Stop ranting about what others are doing in their free time, then. -- Kie ekzistas vivo, ekzistas espero. From jimad at msn.com Wed Sep 16 09:31:07 2009 From: jimad at msn.com (James Adcock) Date: Wed, 16 Sep 2009 09:31:07 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: >PG has no control over the format submitted. Nonsense. I have tried submitting ?other things? and have been told repeatedly that the ?minimum requirements in practice of PG? is that a TXT and an HTML file be submitted, and that these two files pass through a large number of fitness tests required by PG, which in practice includes restrictions on the choice of char sets used in the internal rep of the TXT and of the HTML file. So, in fact PG DOES have control over the format submitted, and the way PG asserts that control is by refusing to accept submission of formats and details of those formats that they choose not to support. As a simple counter-example of the above claim ?PG has no control over the format submitted? note that personally I would much rather be submitting TXT files which do not correspond to the PG requirements of including a gratuitous line wrap every 72 chars. Or if I am required to submit TXT files with line wraps I would much prefer to retain the line wraps of the original text, because it is a royal pain for some future volunteer to have to ?fix? the position of line wraps back to the original text in order to do additional processing of the text file in the future, for example because they want to find and include additional semantic information that can be found in the original page scans, but not in the TXT. And in practice it is impossible to do this visual analysis unless one matches line breaks to the original page scans ? as DP well knows. Another example from a couple years ago is I asked PG how I could submit MOBI formatted texts of books they already had in other formats. I was told that I was not allowed to do so. So I set up an independent website to distribute PG books in MOBI format to my friends in the MOBI community -- retaining the PG licenses and legalese conditions. Now, as hoped for, some years later PG has decided to support MOBI after all ? at least to some extent. But: what a pain! Why is this important to me? Well, I happen to like classes of reader machines that the internal mechanizers of PG do not like. PG likes big teletype like display machines, capable of displaying more than 72 chars per line. [Your standard PC or Mac still remains fundamentally a teletype emulator] And PG likes tiny machines with extremely limited displays, also known as cell phones. I personally do not like either of those classes of machines, but rather machines that are middle sized ? small enough that I can pick them up and easily read them while lying in bed late at night for example. But large enough that I can understand in context the ebb-and-flow of what the author wrote in some surrounding context. With these middle-sized machines issues of text reflow become a central issue in the pleasure (or lack thereof) of being able to use the machine. And yes there are quite a number of tools one can use to help ?fix? at least partially ?broken? texts re these machines, including Calibre and say Mobipocket Creator. But I?d rather not have to ?fix? a text each time before I can read it. And I?d rather it not be ?broken? in the first place. So, in summary, as a ?volunteer? am I free to do what I want? Yes, certainly ? but not if I want any of my efforts to ever show up on any PG website! As Bowerbird is only too happy to point out: ?Please feel free to go somewhere else!? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Sep 16 09:34:32 2009 From: jimad at msn.com (James Adcock) Date: Wed, 16 Sep 2009 09:34:32 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: Well, therein lies the problem. *I* [or rather I mean to say _I_] am masochistic enough to take on TEI, but none of the other volunteers are willing to join in with me. >wow, jim looks to be a bit masochistic. could be a perfect candidate. :+) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Sep 16 09:56:15 2009 From: jimad at msn.com (James Adcock) Date: Wed, 16 Sep 2009 09:56:15 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: >you rewrapped lines that weren't supposed to be rewrapped. You make my point for me. When one relies on automagical tools to try to recreate semantic information discarded by the PG TXT representation, more or less often one ends up with something that looks like sh*t - your word not mine. When the results break one is told "Oh you did it wrong, you should have done something else instead." Yes of course, but once one relies on human intervention to "fix" the problem when a particular algorithm breaks, then one does not have an automatic algorithm. Ultimately what one should do if one wants to "get it right" is to abandon attempts at automagical tools which work sometimes and end up looking like sh*t other times and instead take the PG TXT file, take the original page scans, look at the page scans to figure out where the PG TXT files gratuitously entered line breaks where the author didn't intend line breaks, and take them back out. After the gratuitous page breaks are taken back out (the work of a few days - trust me on this!) then one can either, if one has a machine, such as a teletype, incapable of reflow, run the now gratuitous-line-break free TXT back through a simple unambiguous algorithm to insert a line break at the appropriate point for your machine - at a whitespace prior to char72 if you own a teletypewriter, at a whitespace prior to char20 perhaps if you own a cellphone. Or better, if you have a more modern machine, which really, I think most of us DO have, a machine capable of calculating reflow itself aka "word wrap" then you just feed the machine the TXT that doesn't have the gratuitous line breaks and everything works automagically. Assuming one is willing to live with ragged right. Or tolerate slightly ugly word spacing on machines that force right justify (sigh.) Better yet, we should ask our technologist friends to include not only reflow but also automatic hyphenation routines in our machines. Is it too much, for example, to ask PG to provide the option to the rare user who actually WANTS line breaks at char 72, or for that matter actually wants line breaks at char 20, is it too much to ask PG to provide a filter to insert such "gratuitous" line breaks? Consider: PG *already* provides literally 40 different such filter programs to help people with various strange obscure legacy machines. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Sep 16 10:13:40 2009 From: jimad at msn.com (James Adcock) Date: Wed, 16 Sep 2009 10:13:40 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: >what was the book? i'd like to compare the versions myself. The book is E-text #29452 And contrary to your complaints I didn't "strip out anything." A more accurate statement of your complaint is that I didn't waste a whole lot of my time and energy inventing and manually inserting semantic markings in a legacy file format that is a hopelessly broken representation of this book in the first place. If you want to hack up the TXT file some way to make you more happy feel free to do so - you're a "volunteer" too - I'm certainly not going to be reading the TXT file, so personally I don't care what you do! -------------- next part -------------- An HTML attachment was scrubbed... URL: From richfield at telkomsa.net Wed Sep 16 09:26:09 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Wed, 16 Sep 2009 18:26:09 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: <4AB111A1.2050703@telkomsa.net> I have to agree with large parts of what James Adcock says. A lot of it depends on the medium (media in fact), the message, and so on. When I write (without any interest in whether I should be writing or not, or whether anyone cares) there is considerable rumination, not to mention bellyaching, about punctuation, font, typeface, formatting and so on, in fact practically anything that could be done in more than one way. The fundamental thing is information. Alternative ways of representing the information require information to convey them, and offer opportunities for conveying the information. Well-conveyed information is in that respect at least, beautiful. The reason that most of my presentations are fairly spare is that most of what I have to say is fairly directly factual. Conspicuous headings, distinct tables of contents, and clear meanings are usually enough for my purposes because I am no artist. The reason I struggle with punctuation is that I have my own rules, and bugger the grammarians. My rules are: If the punctuation doesn't matter, leave it out. If it changes the meaning, it does matter. Put it in. If it does not really change the meaning, but the reader needs to read something twice to make sense of it, adapt the punctuation, the sentence structure, or even the wording, to provide unconscious, one-pass parsing. If omitting (or inserting) logically unnecessary punctuation is likely to distract or confuse the reader, then don't or do, as the case might be. Know something about common conventions and their significances so that you have some idea of what to flaunt and what to flout. That about somes it up, sum of it anyway. If that is how simple it is, why is it so complicated? Because I am lousy at noticing when I violate those rules. Many people are not that that is more of an excuse than an explanation. Recently I helped I think a friend with a book that he had written in German and translated into English. The book was a straightforward work of philosophy, so it should have been easy. Unfortunately, though he is literate and intelligent, he had absent-mindedly retained a lot of the German commas. It rendered reading of the book such hard work that I could not read it in bulk. I was doing double-takes every few sentences, which was more than was needed to ruin my concentration and wreck my attempts to remain coherently aware of the thread of significance. A sign of mine being a lesser intellect according to Whitehead? Definitely, but remember not only that the average intellect is less than lesser, but what is worse, it is less lesser than half the population. The lesser is who you are writing for more or lesser always. Consider "It is a long tail, certainly,' said Alice, looking down with wonder at the Mouse's tail; 'but why do you call it sad?' And she kept on puzzling about it while the Mouse was speaking, so that her idea of the tale was something like this: 'Fury said to a mouse, That he met in the house, "Let us both go to law: I will prosecute you. --Come, I'll take no denial; We must have a trial: For really this morning I've nothing to do." Said the mouse to the cur, "Such a trial, dear Sir, With no jury or judge, would be wasting our breath." "I'll be judge, I'll be jury," Said cunning old Fury: "I'll try the whole cause, and condemn you to death."' 'You are not attending!' said the Mouse to Alice severely. 'What are you thinking of?' 'I beg your pardon,' said Alice very humbly: 'you had got to the fifth bend, I think?' 'I had NOT!' cried the Mouse, sharply and very angrily." Then again, pace archy, how about something like: "Wenn hinter fliegen fliegen fliegen fliegen fliegen fliegen hinternach" or "smith who when jones had had had had had had had had had had had the judgement of the examiners in his favour" Or which would fit the writer's intention better: "You would be the lad for that." or "You would be the lad for that." or "You would be the lad for that." How many ways with more or less distinct meanings could one place the emphasis in "Two twenty-buck tickets for her show I should buy"? If anyone does not believe that punctuation matters, try reading "Eats shoots and leaves" by Lynne Truss. (If you haven't read it anyway, do yourself a favour and read it anyway.) Now all that is great fun, compared to waiting for Godot with a hangover in a hot public lavatory at the terminus of a diesel trucking company in Houston, but if you actually wish to write (or convey someone else's writing) with efficiency and with respect for the information, the author, and the reader, then you will use all the channels of information that the medium (media, funiculi, funicula ) that assist without increasing the noise to signal ratio. The fact that some authors don't need or want it is irrelevant. The right amount is what works best, and if he wants nothing, that is the right amount. It does not follow that it is the right amount elsewhere. The fact that your reader can get no end of fun out of Joyce without punctuation, does not mean that the same must apply for figurate verse or calligraphic works. The medium is rarely the message unless the message is about the medium, unless you are in one of the bottom-feeding niches, or a great artist, but to gird at more powerful notations because less can be made to do, for some people, mostly, with some exertion, is poorly persuasive, let alone cogent. I never did like Gertrude Stein. I don't know when where or how anyone will come up with the generally universally and perfect notation. (I know when I think they will, but that is another story.) All I ask is that they please make it something that can be read with a vanilla text reader, no instruction manual, and some patience, even if the proper markup interpreter on a great audiovisual system or tiny cellphone can give a mind-blasting performance. Personally I would like it to start with the vanilla text and punctuation and have the markups follow as an appendix, to be ignored when unwanted or not understood. Patience upon a rock Smiling at grief Because she is wearing ear plugs. (Of course?) In my case I am privileged because I am not dependent on pure txt. If necessary I can convert PDF, though I have never learned its internal format. Cheers, Jon From jimad at msn.com Wed Sep 16 10:27:48 2009 From: jimad at msn.com (James Adcock) Date: Wed, 16 Sep 2009 10:27:48 -0700 Subject: [gutvol-d] Re: z.m.l. can do what you want In-Reply-To: References: Message-ID: > http://z-m-l.com Well, if "the proof is in the pudding" then I invite the other readers to compare the z.m.l results for "Scrooge" at: http://www.z-m-l.com/go/vlconvert.html to the PG TEI results for "My Antonia" at: http://www.gutenberg.org/files/19810/19810-pdf.pdf However, I will point out that neither markup has the proper support necessary to render correct EPUB nor MOBI file formats. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Sep 16 10:37:40 2009 From: jimad at msn.com (James Adcock) Date: Wed, 16 Sep 2009 10:37:40 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: So the "solution" is that all readers should buy an iphone and run "eucalyptus" I can't disagree with that - it IS relatively trivial to render attractive text if one knows one is rendering to only one particular machine. Help me out, now tell us readers WHICH cell phone company we ought to switch to? -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Wed Sep 16 10:56:34 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 16 Sep 2009 19:56:34 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> Message-ID: <4AB126D2.2050706@perathoner.de> David Starner wrote: > But we know HTML, for one. We also have tools that help us with HTML, > for two. For three and the strike-out, I have a host of tools that > will help me edit, verify and view HTML, but there is no Debian > packages for PGTEI. Where's the debian package for guiguts? I had to actually edit the code to make it run on my debian/unstable. nxml-mode in emacs is all you'll ever need to edit and validate xml. Or use the TEI stylesheets in OpenOffice, if you must needs have WYSIWYG. Sheesh! >> PG has an implementation of TEI. I know you don't like it because you >> haven't figured out how to produce pretty title pages. > > Note: by "pretty title pages" Marcello means a title page that looks > like any title page in an actual book. Once again, I grabbed the > nearest books; I have ten books, by ten different publishers, > including two in Esperanto and one in a mixture of Esperanto and > Chinese, and with the exception of one of the English books which > right-justifies its title page, they all follow the basic format of > centered pages, title (new line) author (bottom of page) publisher. > None of them look a darn thing like the title pages PGTEI prints out. Ohh. Pleeeeease! Go here: http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf and tell me what you don't like about the title page. And then go here: http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html to verify that it looks the same in HTML. And then go here: http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt to see how it looks in TXT. All from ONE and the same TEI master. > If I saw a single document produced from PGTEI that was > suitable for end-user consumption, I might support it. http://www.gnutenberg.de/pgtei/0.5/examples/ > The librarian is never the end-user. The librarian is the person who > makes it available to the end-user. Nobody around here cares about the > linguistic researcher as the end user, and we will never produce files > that are marked up with the type of information--like distinguishing > sentence ending punctuation from the same punctuation used other > ways--that they need. The end user we're targeting is the reader. (Distinguishing punctuation is very important for typesetters.) YOU are targeting the reader that reads on a desktop browser. I am targeting everybody on every platform of every size and every software that might want to use or convert our books in any way imaginable or not yet imaginable. > Yes, in fact, some PPers do want to produce an etext that replicates > the original, includes the important illustrated dropcaps (that are > frequently as much a part of the illustration of the book as any other > illustration) and page numbers (that are crucial for much of the > non-fiction that we reproduce, especially if you want to follow the > web of references from one PG era book to another.) And while they are busy `replicating the original? they miss all opportunities of electronic text. Eg. the index entries are still linked to the *page* they reference, while it was technically possible for decades now to go directly to the word. So if the reader clicks on an indexed term, she must read all the page until she finds the reference instead of going directly to the reference (and maybe have it highlighted like on Wikipedia). This opportunity of making the books more accessible has been missed because DP is still producing electronic facsimiles instead of electronic books. Eg. speaker tagging. In a few years when everybody will have speech syntesis on their cell phones ebook readers people may want to listen to their books while driving. If you have quotes marked up you can assign different voices to different speakers. Eg. geografic tagging. While visiting someplace you may want to find all book references that refer to the place you are in. DP misses out again and again. But they make pretty facsimiles ... > And if you had produced TEI output that could do what people wanted to > do, it's possible that we would have better output on the ePub > devices. If people had started using TEI instead of griping endlessly about minor shortcomings, we might have now a complete TEI workflow in place. > Right now, I would be surprised to find that PGTEI can output > at all to ePub, and I wouldn't be surprised if the people who produced > the DP output were happier with the results of their HTML translated > to ePub than your HTML translated to ePub. PGTEI outputs just fine to ePub. Just take the HTML output and convert it in Calibre or whatever you are using. Look here (this is PDF, not ePub): http://www.gnutenberg.de/pgtei/0.5/examples/pgtei-pdf-sony-reader.jpg -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Wed Sep 16 11:31:13 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 14:31:13 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > As Bowerbird is only too happy to point out: > ?Please feel free to go somewhere else!? ? jim, jim, jim. i'm just about the only person here who has any sympathy with what you are saying, and who is interacting with you in good faith, so why you be stabbin' me in da back like that? don't put words in my mouth. i never said anything like that. indeed, i have often advocated that p.g. e-texts should retain the linebreaks from the original, just like you've advocated, for the same reason. likewise, long before anyone else cared about it, i argued that lines which should not be rewrapped should be prefaced with one or more spaces, so they could easily be treated correctly during rewrap. i've also asked, repeatedly, for the plain-text files to include the names of the graphics, so that my viewer-apps could display them at the right place, which i assume is a practice you would also support. moreover, i am the only person here who has set up a website that people can use to unwrap the e-texts: > http://z-m-l.com/unwrap.pl so don't be raggin' on me, ok? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 16 11:57:29 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 14:57:29 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > You make my point for me. yes, i do, jim. and i make your point _much_better_ than you do, because i don't do the sabotage thing to it first. > When one relies on automagical tools to try to > recreate semantic information discarded by > the PG TXT representation, more or less often > one ends up with something that looks like sh*t ? > your word not mine. actually, i got the word from the dictionary, so you should feel free to use it without me. your tools are neither "auto" nor "magical" enough. it is only when you've improved them to the point that they cannot be improved any more that you earn the right to bitch about what others are doing. > When the results break one is told > ?Oh you did it wrong, you should > have done something else instead.? i didn't say "you did it wrong" as some kind of mystical power intended to make you go away. i said that you did it wrong because you did it wrong. it ends up that, with the right tool, and if you do it right, the p.g. e-text format works perfectly well, or at least it _can_, if the whitewashers really did what they say they do. and sometimes they do. for instance, if you would have taken your "hamlet" text from the newest version in the library, and put it into my unwrap site listed above, you'll see that it works just fine. so there is no _shortcoming_ of the p.g. plain-text format that needs to be "overcome". there are only some _flaws_ -- a portion of which seem to be intentionally inflicted -- which need to be corrected, so that the format can shine... > once one relies on human intervention to ?fix? the problem > when a particular algorithm breaks, then one does not have > an automatic algorithm. i agree. but that's not what is at issue here. > Ultimately what one should do if one wants to ?get it right? > is to abandon attempts at automagical tools which work > sometimes and end up looking like sh*t other times > and instead take the PG TXT file, take the original page scans, > look at the page scans to figure out where the PG TXT files > gratuitously entered line breaks where the author didn?t > intend line breaks, and take them back out. see, jim, here's where you get things half-right-but-kinda-wrong. you just haven't thought through these things well enough so that you can explain them clearly, so it comes out in this mumbo-jumbo. > After the gratuitous page breaks are taken back out > (the work of a few days ? trust me on this!) again, you're severely unclear here. (and please, please, please, if anyone thinks that jim _is_ being "clear", do jump in and say so and help provide an explanation.) > then one can either, if one has a machine, such as a teletype, > incapable of reflow, run the now gratuitous-line-break free > TXT back through a simple unambiguous algorithm to insert > a line break at the appropriate point for your machine ok, here's a relatively straightforward description of the process. but, really, jim, there's no need for it. we programmers _know_ how to do this. it's not difficult. the guy who coded "eucalyptus" did a fine job on doing this, and he is using the p.g. text-files, so they don't really present the insurmountable problem you think... > Or tolerate slightly ugly word spacing on machines that > force right justify (sigh.) Better yet, we should ask our > technologist friends to include not only reflow but also > automatic hyphenation routines in our machines. again, not to beat a dead horse, but eucalyptus does ragged-right or justification, whichever the user prefers, and hyphenation too, so everything you're asking for has already been done at least once. rather than harping about the format -- which does just peachy, thank you very much -- you need to complain about the coders who are not giving you the type of tools you would like to have... > Is it too much, for example, to ask PG to provide the option > to the rare user who actually WANTS line breaks at char 72, > or for that matter actually wants line breaks at char 20, > is it too much to ask PG to provide a filter to insert such > ?gratuitous? line breaks? Consider: PG *already* provides > literally 40 different such filter programs to help people > with various strange obscure legacy machines. more sloppy thinking, jim. what do you really _mean_ when you say "provide the option" or "provide a filter"? i take it to mean "give your users a tool that does that". and if you started saying it that way, you'd come to realize that your beef is not with the p.g. text-file format at all, but rather the fact that p.g. isn't supplying users with the tools we need... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 16 12:18:21 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 15:18:21 EDT Subject: [gutvol-d] Re: z.m.l. can do what you want Message-ID: jim, i'm losing patience with you, quickly, i warn you. *** jim said: > I invite the other readers to compare > the z.m.l. results for ?Scrooge? at: > http://www.z-m-l.com/go/vlconvert.html first, the file at that address is volatile, depending on which .html conversion was last made via on button-click on the page which is located over at: > http://www.z-m-l.com/go/vl3.pl so, to ensure you've got "scrooge" at the address listed above, click on that button. (that'll be the "a christmas carol" button.) > I invite the other readers to compare > the z.m.l results for ?Scrooge? at: > http://www.z-m-l.com/go/vlconvert.html > to the PG TEI results for ?My Antonia? at: > http://www.gutenberg.org/files/19810/19810-pdf.pdf ok, do you understand that is comparing a converted .html file to a converted .pdf? if you want to compare, compare .html to .html, and .pdf to .pdf, because that makes some sense. oh, and by the way, picking "my antonia" was not a wise move, because i've done extensive research work on that particular digitization. so if you want me to onslaught it at you, let me know. > However, I will point out that neither markup has > the proper support necessary to render > correct EPUB nor MOBI file formats. show me the "proper" markup to accomplish the mobi, and i will edit the template so you can get that markup. then have the .tei guys do what they would need to do. because i'd love to see the mobi they get from their .tei. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbnewby at pglaf.org Wed Sep 16 12:20:43 2009 From: gbnewby at pglaf.org (Greg Newby) Date: Wed, 16 Sep 2009 12:20:43 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AB126D2.2050706@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> Message-ID: <20090916192042.GA9297@pglaf.org> > Go here: > > http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf I'd like to support what Marcello wrote, below. I've long believed that having TEI be the native output from Distributed Proofreaders is desirable. My understanding is they just don't have the available person-power to implement this. When a TEI eBook is submitted to the whitewashers, we have a very nice processing stream (with the pieces mentioned below) to very easily produce .txt, .htm and anything else we might want. If we had enough of eBooks with TEI as the native format, we could add transformation options to www.gutenberg.org's catalog pages, to truly provide "your book, your way." There's no lack of ability to produce, transform or otherwise work with TEI files. As someone pointed out, the DP proofreading is essentially agnostic about the back-end encoding format. The postprocessors might see some variation in the workflow, but would not necessarily need to work directly with TEI markup. I think the existing software and examples are compelling. If there was an easier way of getting TEI embedded into the DP workflow, it would have happened by now. -- Greg On Wed, Sep 16, 2009 at 07:56:34PM +0200, Marcello Perathoner wrote: > David Starner wrote: > > >> But we know HTML, for one. We also have tools that help us with HTML, >> for two. For three and the strike-out, I have a host of tools that >> will help me edit, verify and view HTML, but there is no Debian >> packages for PGTEI. > > Where's the debian package for guiguts? I had to actually edit the code > to make it run on my debian/unstable. > > nxml-mode in emacs is all you'll ever need to edit and validate xml. > > Or use the TEI stylesheets in OpenOffice, if you must needs have > WYSIWYG. Sheesh! > > >>> PG has an implementation of TEI. I know you don't like it because you >>> haven't figured out how to produce pretty title pages. >> >> Note: by "pretty title pages" Marcello means a title page that looks >> like any title page in an actual book. Once again, I grabbed the >> nearest books; I have ten books, by ten different publishers, >> including two in Esperanto and one in a mixture of Esperanto and >> Chinese, and with the exception of one of the English books which >> right-justifies its title page, they all follow the basic format of >> centered pages, title (new line) author (bottom of page) publisher. >> None of them look a darn thing like the title pages PGTEI prints out. > > Ohh. Pleeeeease! > > Go here: > > http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf > > and tell me what you don't like about the title page. > > And then go here: > > http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html > > to verify that it looks the same in HTML. > > And then go here: > > http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt > > to see how it looks in TXT. > > > All from ONE and the same TEI master. > > >> If I saw a single document produced from PGTEI that was >> suitable for end-user consumption, I might support it. > > http://www.gnutenberg.de/pgtei/0.5/examples/ > > >> The librarian is never the end-user. The librarian is the person who >> makes it available to the end-user. Nobody around here cares about the >> linguistic researcher as the end user, and we will never produce files >> that are marked up with the type of information--like distinguishing >> sentence ending punctuation from the same punctuation used other >> ways--that they need. The end user we're targeting is the reader. > > (Distinguishing punctuation is very important for typesetters.) > > YOU are targeting the reader that reads on a desktop browser. > > I am targeting everybody on every platform of every size and every > software that might want to use or convert our books in any way > imaginable or not yet imaginable. > > >> Yes, in fact, some PPers do want to produce an etext that replicates >> the original, includes the important illustrated dropcaps (that are >> frequently as much a part of the illustration of the book as any other >> illustration) and page numbers (that are crucial for much of the >> non-fiction that we reproduce, especially if you want to follow the >> web of references from one PG era book to another.) > > And while they are busy `replicating the original? they miss all > opportunities of electronic text. > > > Eg. the index entries are still linked to the *page* they reference, > while it was technically possible for decades now to go directly to the > word. So if the reader clicks on an indexed term, she must read all the > page until she finds the reference instead of going directly to the > reference (and maybe have it highlighted like on Wikipedia). > > This opportunity of making the books more accessible has been missed > because DP is still producing electronic facsimiles instead of > electronic books. > > > Eg. speaker tagging. In a few years when everybody will have speech > syntesis on their cell phones ebook readers people may want to listen to > their books while driving. If you have quotes marked up you can assign > different voices to different speakers. > > > Eg. geografic tagging. While visiting someplace you may want to find all > book references that refer to the place you are in. > > > DP misses out again and again. > > > But they make pretty facsimiles ... > > >> And if you had produced TEI output that could do what people wanted to >> do, it's possible that we would have better output on the ePub >> devices. > > If people had started using TEI instead of griping endlessly about minor > shortcomings, we might have now a complete TEI workflow in place. > > >> Right now, I would be surprised to find that PGTEI can output >> at all to ePub, and I wouldn't be surprised if the people who produced >> the DP output were happier with the results of their HTML translated >> to ePub than your HTML translated to ePub. > > PGTEI outputs just fine to ePub. Just take the HTML output and convert > it in Calibre or whatever you are using. > > Look here (this is PDF, not ePub): > > http://www.gnutenberg.de/pgtei/0.5/examples/pgtei-pdf-sony-reader.jpg > > > -- > Marcello Perathoner > webmaster at gutenberg.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From Bowerbird at aol.com Wed Sep 16 12:36:41 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 15:36:41 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > So the ?solution? is that all readers should > buy an iphone and run ?eucalyptus? no. the "solution" is to understand that a viewer-program like eucalyptus can be programmed on _any_ platform, designed to take the project gutenberg plain-text files and display them in a beautiful (and powerful) manner. the "solution" is not to debunk that plain-text format -- which is what you seem to be wanting to do here -- and _certainly_ not invent a new cockamamie format, but rather to patch the small inconsistency problems that haunt the library so that the plain-text files are dependable and reliable in terms of delivering beauty. > I can?t disagree with that ? it IS relatively trivial > to render attractive text if one knows one is > rendering to only one particular machine. have you ever done any programming, jim? and, in particular, have you ever done e-book coding? even though eucalyptus is "rendering to only one particular machine", it allows the end-user to pick the font-size, which requires rewrapping the text. in addition, many apps let you switch to landscape, which means you must code for two screen-sizes... it's not as easy as you make it sound to make an app, even if it's just "for one particular machine". however, once you've made such an app, it's not that difficult to port it to another machine, or to another language, or to hack it for some specific purpose you only need today, or to modify it to fit your own personal preferences, or... but the important thing to remember as far as this thread is that the "vanilla" .txt format used by project gutenberg is extremely close to being totally sufficient as a file-format. it just needs to have a few ambiguous situations cleaned up, and then the whole library needs to undergo quality control because there are some rather glaring inconsistencies there. but we don't need a new format... never have... never will... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 16 12:41:27 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 15:41:27 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: greg wants the distributed proofreaders to jump through all kinds of _unnecessarily_ difficult hoops, just so that project gutenberg can _supposedly_ get the benefits that i've already _proven_ can be obtained from the .txt format. greg went to library school. he's supposed to be smart. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 16 12:54:32 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Sep 2009 15:54:32 EDT Subject: [gutvol-d] david and marcello Message-ID: both david starner and marcello are relegated to my spam folder, but it's nice to see them fighting. marcello has convinced himself that pgtei is the next big thing, and has been ever since 2001... unfortunately for mr. marcello, he hasn't persuaded d.p. people. they tried .tei, and ran screaming from the scene due to the stench. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From joyce.b.wilson at sbcglobal.net Wed Sep 16 13:30:52 2009 From: joyce.b.wilson at sbcglobal.net (Joyce Wilson) Date: Wed, 16 Sep 2009 15:30:52 -0500 Subject: [gutvol-d] Ebook 30000! And some sort of catalog milestone Message-ID: <4AB14AFC.9040603@sbcglobal.net> I saw today that ebook 30000 has been posted! Congratulations! And in other (more self-serving) good news, I see we have 14271 books with no subjects in their bib records and 12062 with no LoCC in their bib records, so I guess it's no longer the case that "more than half of our books don't have subject information added" and "more than one half of our books don't have LoC info" as the "Help on Bibliographic Record Page" says. : -) Joyce Wilson PG Cataloging Team From prosfilaes at gmail.com Wed Sep 16 22:47:03 2009 From: prosfilaes at gmail.com (David Starner) Date: Thu, 17 Sep 2009 01:47:03 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AB126D2.2050706@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> Message-ID: <6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com> On Wed, Sep 16, 2009 at 1:56 PM, Marcello Perathoner wrote: > Where's the debian package for guiguts? There's not one, but that's why it's called in-house code and we can talk to the programers if we need help. But there are several Debian packages for programs that can check and display HTML. >>> PG has an implementation of TEI. I know you don't like it because you >>> haven't figured out how to produce pretty title pages. > > Ohh. Pleeeeease! So you attack people for having made complaints that were perfectly valid when they were made? Unless you're going for the martyr award, I hardly see how that's productive. > Eg. the index entries are still linked to the *page* they reference, while > it was technically possible for decades now to go directly to the word. If they are still linked to the page instead of the word, it's because the PPer looked at a 50 page index and decided that there was no way they were going to wade through there and try and find where on the page the link was intended to go to for 20,000 references. HTML and TEI are no different here. > Eg. geografic tagging. While visiting someplace you may want to find all > book references that refer to the place you are in. Maybe. There's a very real question whether it's worth the man-power to mark this up, and it's really a bit of a gratuitous feature. > If people had started using TEI instead of griping endlessly about minor > shortcomings, we might have now a complete TEI workflow in place. If you had listened to the needs of the people who you wanted to start using TEI instead of bitching about them and their requirements, maybe they would have started using TEI. -- Kie ekzistas vivo, ekzistas espero. From schultzk at uni-trier.de Thu Sep 17 00:05:17 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 17 Sep 2009 09:05:17 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AAF5C48.1040201@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> Message-ID: <65729AE2-DCF6-4844-BF5D-49A8809242D9@uni-trier.de> Hi There, Am 15.09.2009 um 11:20 schrieb Marcello Perathoner: > Keith J. Schultz wrote: > >> We need a a format that is not based on an existing format, ... > > Why not? Very simply. Basically, most formats have a particular output in mind! Furthermore they are far too complex. The idea is to markup the book text in a way that we can extract its structure and features. Then depending on the the output format is created. > > >> ... but we want a representation that contains as much information as >> possible. >> It should only take about a month to create such a format. > > ROTFL I said to create such a format. I did not say create the tools for creating output formats. Which is the actual crux if you have been trying to follow this thread. Also, you need tools for getting the scan into this format from scans which should be done mostly by a computer inorder to save time. regards Keith. From schultzk at uni-trier.de Thu Sep 17 00:55:48 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 17 Sep 2009 09:55:48 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com> Message-ID: Hi Their, I have look at TEI, also the way things SHOULD be encoded and said NO WAY!! Fasr to complicated. As I have mention here time and time again a output format should not be presupossed. The layout of a page is not that hard to markup. regards Keith. Am 17.09.2009 um 07:47 schrieb David Starner: > On Wed, Sep 16, 2009 at 1:56 PM, Marcello Perathoner > wrote: > > Maybe. There's a very real question whether it's worth the man-power > to mark this up, and it's really a bit of a gratuitous feature. > >> If people had started using TEI instead of griping endlessly about >> minor >> shortcomings, we might have now a complete TEI workflow in place. > > If you had listened to the needs of the people who you wanted to start > using TEI instead of bitching about them and their requirements, maybe > they would have started using TEI. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Thu Sep 17 00:55:52 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 17 Sep 2009 09:55:52 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AB0AD87.4050704@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> <4AAF5C48.1040201@perathoner.de> <4AB0AD87.4050704@perathoner.de> Message-ID: <11EF7709-07DD-4278-9394-CDCCAD76D11D@uni-trier.de> Am 16.09.2009 um 11:19 schrieb Marcello Perathoner: > Al Haines (shaw) wrote: > >> It's clearly stated in PG Volunteers' FAQ V.89 (http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F >> ) that if you want to prevent unwanted wrapping, lines that should >> not be wrapped should be indented a space or two. > > This `markup? does not distinguish between poetry and a block quote. > A block quote should be indented *and* rewrapped. It depends on what is considered desirable > > > And the Rewrap Blues is only part of the problem ... > > Another formidable challenge is to recover the chapter headings and > other headings to make them stand out and to build a TOC. I have to disagree here. Any fourth grader can do it. There are certain rules which one can follow. It will not handle all possible cases, yet most. But, then again that what proofers can handle easily. regards Keith. From schultzk at uni-trier.de Thu Sep 17 00:55:55 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 17 Sep 2009 09:55:55 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: Hi There, Am 16.09.2009 um 08:12 schrieb James Adcock: > As an example of how much author semantic information is lost going > from an > author's writing to PG txt format, I went and compared differences > between a > recent HTML and PG TXT I did -- where after doing the TXT encoding I > went > back and did three more passes over the images to add back in semantic > differences to the HTML that the PG TXT didn't represent. The problem is that there are very few systems that truely represent semantic content. Inorder to truely represent such information you have to know about it. This requires one to have aditional information which is know as "world knowledge". This information is provided by the reader of books. > > Now the reality would be that it would take say TEI not HTML to > represent > all of the author's intent. But measuring the loss going from HTML > back to > TXT gives an order of magnitude estimate of how much author > information we > are throwing away by representing a work in PG TXT. In the case of > this > book, the answer was more than 1000 "losses" -- or an average of > about 3 > losses per page. And this is NOT counting about an addition 1000 > losses in > representation of emphasis. This problem is a matter of complexity. That is even in pure Vanilla Text one can reprensent these intentions, but one loses readablity. Furthermore one has to make assumptions of the true intent of the author!! regards Keith -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Thu Sep 17 04:25:12 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 17 Sep 2009 13:25:12 +0200 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com> Message-ID: <4AB21C98.4020901@perathoner.de> David Starner wrote: >> Eg. geografic tagging. While visiting someplace you may want to find all >> book references that refer to the place you are in. > > Maybe. There's a very real question whether it's worth the man-power > to mark this up, and it's really a bit of a gratuitous feature. Gratuitous to people who have no vision. People who think they are `preserving?, while they are only consigning to rot on a different medium. You take a book from a dusty bookshelf, digitize it, and put it on a file server. You have taken content expressed in technology of 500 years ago and `updated? it to technology of 20 years ago. Today its all mobile devices. Ebooks have to come along in your shirt pocket or die. Wikipedia is doing it: http://en.wikipedia.org/wiki/File:Wikitude3.jpg There are many travel books in PG that could be marked up like that. -- Marcello Perathoner webmaster at gutenberg.org From joyce.b.wilson at sbcglobal.net Thu Sep 17 04:50:36 2009 From: joyce.b.wilson at sbcglobal.net (Joyce Wilson) Date: Thu, 17 Sep 2009 06:50:36 -0500 Subject: [gutvol-d] "PG volunteer lounge" list at Yahoo Groups Message-ID: <4AB2228C.4010304@sbcglobal.net> In the spirit of the world-famous DP spa, there is now a "Project Gutenberg volunteer lounge" list at Yahoo Groups: http://groups.yahoo.com/group/PG_vol_lounge/ Description: A friendly, supportive, and civil forum for Project Gutenberg volunteers. Will be moderated as needed to keep it that way. Group Email Addresses: Post message: PG_vol_lounge at yahoogroups.com Subscribe: PG_vol_lounge-subscribe at yahoogroups.com Unsubscribe: PG_vol_lounge-unsubscribe at yahoogroups.com List owner: PG_vol_lounge-owner at yahoogroups.com From marcello at perathoner.de Thu Sep 17 05:30:25 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 17 Sep 2009 14:30:25 +0200 Subject: [gutvol-d] Re: "PG volunteer lounge" list at Yahoo Groups In-Reply-To: <4AB2228C.4010304@sbcglobal.net> References: <4AB2228C.4010304@sbcglobal.net> Message-ID: <4AB22BE1.9060600@perathoner.de> Joyce Wilson wrote: > In the spirit of the world-famous DP spa, there is now a "Project > Gutenberg volunteer lounge" list at Yahoo Groups: > > http://groups.yahoo.com/group/PG_vol_lounge/ > > Description: A friendly, supportive, and civil forum for Project > Gutenberg volunteers. Will be moderated as needed to keep it that way. ... and you can bring your baby too: http://groups.yahoo.com/group/dpmoms/ -- Marcello Perathoner webmaster at gutenberg.org From jimad at msn.com Thu Sep 17 10:25:20 2009 From: jimad at msn.com (James Adcock) Date: Thu, 17 Sep 2009 10:25:20 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: I retract the words I put into Bowerbirds mouth ? I must have misinterpreted his intent in some of the things he emailed ? welcome to email. How about as a practical suggestion to at least make *some* progress forward out of this morass, how about if we agree that HTML be *one* of the sufficient file formats by itself for inclusion into PG, without the need to also submit PG TXT format? Since Bowerbird claims it is easy to go from PG TXT to HTML, then certainly it is as easy to go from HTML to PG TXT ? and Bowerbird, or someone -- can provide a ?Make PG Happy? automagical tool that will ?properly? encode PG TXT indentation for verse in order to encode ?please don?t wrap me? and can make PG-Happy decisions about how exactly to wrap at ?72 chars? without making the underlying text too ugly (not a trivial issue in my experience) and can decide what ligatures in what embedded languages should be broken into two chars, and what underlying PG TXT char encoding should be used to make the best tradeoffs between maintaining the glyphs the original author used vs. ?how low can you go? backwards compatibility with the various teletype emulator programs in use worldwide. Etc. Because, frankly, as a volunteer, these issues nauseate me. It is not my cup of tea. I would much rather put my time and effort into trying to do ONE reasonably good encoding of a real-world honest to god book published by some real-world publisher preferably during the lifetime of the author so that hopefully Michael will not continuously make the argument that publishers never respect the intent of the author anyway. [In my experience the first and second editions publishers of John Muir ?First Summer? DID do to a very good job of representing in printed form the style and flavor of the hand-written camp notebooks Muir made during that summer and on the contrary it is the mechanizations of DP in trying to follow the coding conventions of PG that discards this intent? so I believe it is Michael who is making excuses for *PG* being the publisher who doesn?t respect the original intent of the author by requiring PG coding conventions be respected uber alle!] And I do not pretend to be ?perfect? in my choices of encoding these books ? which is *precisely* why I would like to have an acceptable input acceptance format that is not ?write once? but could be picked up and improved by another volunteer in the future ? perhaps one who say has a photocopy of John Muir?s handwritten camp notes at hand, and can perform a Ph.D. thesis-level encoding of what Muir ?really meant to say? perhaps using the full power of say TEI. >jim said: >> As Bowerbird is only too happy to point out: >> ?Please feel free to go somewhere else!? >jim, jim, jim. i'm just about the only person here >who has any sympathy with what you are saying, -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Thu Sep 17 11:09:26 2009 From: jimad at msn.com (James Adcock) Date: Thu, 17 Sep 2009 11:09:26 -0700 Subject: [gutvol-d] Re: z.m.l. can do what you want In-Reply-To: References: Message-ID: >ok, do you understand that is comparing >a converted .html file to a converted .pdf? Yes, because in either case we are examining the possible *output* rendering file format *rendered results* currently accomplished by an advocate of a particular *input* encoding file format. Assuming that a particular output formatting rendering software or hardware is available on a particular hardware machine. My currently favorite hardware machine has built-in very good support for PDF, weak support for PG TXT, and little or no support for HTML (unless I read that HTML ?on line? using the machine?s weak web browser) So in practice, for a given hardware machine that I choose to use then the choice of output file rendering format becomes a non-issue ? as long as the hardware machine supports it. If the hardware machine doesn?t support it, then I have to find software that renders one output rendering file format to a different rendering file format [running that cross-rendering software on a difference machine which does support the cross-rendering software] ? which ALMOST ALWAYS in practice causes considerable semantic loss of author?s original intent, plus excessive ugliness. Which is why we would like a strong input encoding file format, one which is NOT overly concerned with how the ink get rendered on the display, so that we can avoid the problem of having to cross-render output rendering file formats. As a reader, I no more care if HTML or PDF is the output rendering file format than in the choice of rendering graphics language the computer display card or embedded graphics chip eventually runs. As a reader I just care how readable vs. how ugly the resulting ink on the display ends up. >show me the "proper" markup to accomplish the mobi, >and i will edit the template so you can get that markup. I think what the ?proper? markup is, is what we are trying to discuss. MOBI, and EPUB, have a concept of a Spine, which requires information not typically included in current markups, such as proper identification of author first name, last name. One possible markup from OPF showing one (pretty good imho) way this can be done is: Rev. Dr. Martin Luther King Jr. This spine information is used, for example, to provide the reader the option of listing his/her books alphabetically by author last name. And I am sure that someone is now sure to claim that automagical tools can be created to correctly extract this information, but I would hope that these famous author name examples would be enough to persuade you otherwise: Sun Tzu Miguel de Cervantes Marquis de Sade And I am sure someone else will claim that this information can automagically be provided by the PG database itself, but again perusal of how author names are currently being encoded in the PG database *ought* to be enough to dissuade people that spine information can be correctly provided automagically from that location. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Thu Sep 17 11:54:00 2009 From: jimad at msn.com (Jim Adcock) Date: Thu, 17 Sep 2009 11:54:00 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: >the "solution" is to understand that a viewer-program like eucalyptus can be programmed on _any_ platform, designed to take the project gutenberg plain-text files and display them in a beautiful (and powerful) manner. I challenge you to write such a program to run on _my_ choice of platform: Kindle DX From jimad at msn.com Thu Sep 17 12:53:34 2009 From: jimad at msn.com (Jim Adcock) Date: Thu, 17 Sep 2009 12:53:34 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <4AB126D2.2050706@perathoner.de> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> Message-ID: I personally like Marcello's efforts pretty well, but let me accept his challenge and use his examples as examples of the problems that I *personally* find as a reader of PG texts -- that I *in reality* find with PG's current efforts -- as well as examples of the need for better input markup languages than we currently are using: > Go here: > http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf >and tell me what you don't like about the title page. What I don't like about the title page is that it doesn't show up correctly on my choice of machine, because my choice of machine assumes the existence of spine information. Thus the "Title" shows up on my machine as "4650-pdf" and "Author" shows up as "4650-pdf" So when I come back to my machine two weeks from now and search for this book by title, I cannot find it. And when I search for it by author, I still cannot find it. Other than that, this PDF text, to my surprise, shows up beautifully on my machine. I would, in practice, be willing to read this text. The choice of sans-serif font looks weird, and I would like to be able to change this choice of font, but of course I can't because this is PDF. Other than that, I would be happy to read this as a book representing a good effort from PG. Further, I would be able to download this file via the airwaves while waiting stuck at an airport, for example, and read this book there. In my opinion these results well-represent PG as an electronic publishing house. >And then go here: > http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html >to verify that it looks the same in HTML. I can verify that it neither looks the same nor even shows up on my choice of machine at all, because my machine doesn't support HTML as a native file format. I can, if I am lucky, access this file via the airwaves using the machine's built-in web browser while waiting stuck at the airport, but I cannot store the results as a file, because my machine doesn't support HTML as a built-in file type. So I can read it on the ground, but I probably won't be able to read it in the air, and if I use my browser to access some other web site then I will probably lose this book. [Well, I take that back -- when I actually TRY to read this file via the airwaves as described above, it crashes my machine, requiring a hard reboot] Assuming I am not at an airport, but rather at home with my desktop computers, I can spend about 5 minutes of my time running an output-file-format to output-file-format cross-rendering software to change this HTML to MOBI format, which IS a native file format of my reader machine. The results then show up on my machine pretty beautifully. Except since HTML lacks spine information the Title now shows up as "4650-h" and the Author now shows up as "4650-h" Which means again, if I come back to my machine in two weeks, I will not be able to find this book. However, other than that, I like these results -- now that I have cross-rendered HTML to MOBI. The results are attractive, I CAN change font size. The font displayed is an attractive and appropriate sarif font. The pages reflow correctly. The links work for navigation. I can switch the machine to landscape mode and everything reflows correctly, supporting the capabilities of my machine. This file format would in practice be my favorite choice of file formats for my machine -- even though I can only access it initially from my house via a desktop machine and I have to waste five minutes of my time translating output file formats. In my opinion these results well represent PG as an electronic publishing house. >And then go here: > http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt >to see how it looks in TXT. To my surprise, I CAN take this UTF-8 TXT formatted file, transfer it to my favorite machine, and it DOES open up correctly interpreting the UTF-8 encoding [You learn something new every day!] This file also lacks spine information, so now Author information shows up as "4650-0" and Title shows up as "4650-0" which means once again, if I come back to this machine in two weeks, I will not be able to find this book. Since this file was rendered char72 under the assumption of a fixed pitch font, and since my machine doesn't use fixed pitch fonts, the end result looks silly and amateurish. The "Printers Ornament" renders as laughable junk. The fixed char72 line breaks make the text in practice unreadable unless I choose an impossibly tiny font -- which then still makes the text in practice unreadable. Gratuitous underscores are sprinkled liberally "everywhere" in the text making the text an unreadable hash. I would not read this text if paid $100 to do so. If I paid good money for this text I would ask for double-my-money back. This is my least favorite file format. Further, it also lacks spine information, meaning that again the Author now displays as "4650" and the Title displays as "4650" which means, again, that if I came back to this machine again in two weeks I will not be able to find this book -- which in this case would be a *blessing* ! In my opinion, if I were a first-time "customer" of PG who makes the mistake of choosing this file format to download to read on my brand of machine, I would conclude that PG consists of a bunch of clueless clowns and I would never return to the PG site again. My Opinions Only -- but I would hope this illustrates how IN PRACTICE a real-world customer's opinion of PG will be filtered through the perception of their choice of reading machine -- and in turn how well WHICH choice of PG file formats they happen to choose to download matches the capabilities of their machine. And without the spine information, none of this really works well with my machine. From i30817 at gmail.com Thu Sep 17 13:13:24 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 17 Sep 2009 21:13:24 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> Message-ID: <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> Just a little input about text files and charsets (encodings), since i had to use it for my program. Most browsers and applications open these files in the correctly simply because someone (mostly mozilla) did the hard work of making a fast guessing engine. I wouldn't be amazed if it failed in some books. From i30817 at gmail.com Thu Sep 17 13:15:13 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 17 Sep 2009 21:15:13 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> Message-ID: <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> Also, i used the catalog information to get the title. Metadata in the file name only is not a good way to encode this information, and metadata inside the file, would require a special parser, everywhere. From jimad at msn.com Thu Sep 17 13:29:18 2009 From: jimad at msn.com (Jim Adcock) Date: Thu, 17 Sep 2009 13:29:18 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <44414895897048CF95F99FE12CC881FD@alp2400> <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400> Message-ID: >....Furthermore one has to make assumptions of the true intent of the author!! I'm not sure what the problem is if one has an tag to indicate the author's intent was rendered in the original book in italic and a tag to indicate the author's intent in the original book was rendered in small-caps etc.? On the contrary, the assumptions have to be made when the input markup language and the output rendering file formats are required to be one-and-the-same AND the rendering file format's power is less than that used by real-world printers already 400 years ago. Then the markup transcriber is forced to interpret authors intent and how to compromise that intent in order to make it fit within the constraints of the rendering language -- which is being artificially constrained to be identical to the input markup language. If one had a input markup language that closely follows author's intent as rendered by the original printer then the problem becomes how do you reduce the strength of this markup to match the weaknesses of the output rendering file format, and that in general is an issue of style that can be represented in CSS for example. Or hacked up by hand if and when absolutely necessary. But it still means that the previous round of volunteers efforts are correctly and completely maintained in the input markup language text so that the next round of volunteers can take another shot at the text some time in the future. From jimad at msn.com Thu Sep 17 14:02:47 2009 From: jimad at msn.com (Jim Adcock) Date: Thu, 17 Sep 2009 14:02:47 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> Message-ID: PG catalog may be a reasonable way to get Title information. As presently implemented the PG catalog is not a reasonable source of Author Firstname, Lastname information -- for multiple reasons! From i30817 at gmail.com Thu Sep 17 16:10:43 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 00:10:43 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> Message-ID: <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> :) I had to make this loop for it (but then again i am indexing it, and not seperating the names like you are, that takes a little bit more work.) The code tries to find a date in the last part of the string, and then reorder all last name, first name multiple authors (i wanted normal order) Note the "possibles", and "hopefully" there, but i think it works for all books i encountered. In fact the names i reported a while ago as defective were found in errors in this method. It isn't that there is no method, it's just extremely ... non-normalized. private final StringBuilder normalizeString = new StringBuilder(); protected String normalizeName(String authorString) { int separator = authorString.lastIndexOf(','); //normal date seperator if (separator != -1) { String possibleDate = authorString.substring(separator + 1); for (int i = 0; i < possibleDate.length(); i++) { if (Character.isDigit(possibleDate.charAt(i))) { //a date, hopefully... return exchangeNames(authorString.substring(0, separator)); } } //no date, but change the name anyway. return exchangeNames(authorString); } return authorString; } protected String exchangeNames(String authorString) { normalizeString.setLength(0); exchangeNamesAux(authorString); return normalizeString.toString(); } private void exchangeNamesAux(String authorString) { int seperator = authorString.indexOf(','); if (seperator == -1) { normalizeString.append(authorString); return; } exchangeNamesAux(authorString.substring(seperator + 2)); normalizeString.append(' ').append(authorString.substring(0, seperator)); } On Thu, Sep 17, 2009 at 10:02 PM, Jim Adcock wrote: > PG catalog may be a reasonable way to get Title information. ?As presently > implemented the PG catalog is not a reasonable source of Author Firstname, > Lastname information -- for multiple reasons! > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From i30817 at gmail.com Thu Sep 17 16:23:02 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 00:23:02 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> References: <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> Message-ID: <212322090909171623y359a48b3xf5cdbdc59cc230f1@mail.gmail.com> Correction, that is not multiple authors, but one per string + date. From i30817 at gmail.com Thu Sep 17 16:25:00 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 00:25:00 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171623y359a48b3xf5cdbdc59cc230f1@mail.gmail.com> References: <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> <212322090909171623y359a48b3xf5cdbdc59cc230f1@mail.gmail.com> Message-ID: <212322090909171625v2e96f457n3efb52126244e6ff@mail.gmail.com> BTW i found this in my catalog post processor / indexer. //can have \n stupidly... String titleString = title.stringValue().replaceAll("\n", " "); :) From jimad at msn.com Thu Sep 17 16:30:02 2009 From: jimad at msn.com (James Adcock) Date: Thu, 17 Sep 2009 16:30:02 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> References: <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> Message-ID: LOL -- and what pray tell do you get for the Author Lastnames in the examples I gave using your algorithm? >private final StringBuilder normalizeString = new StringBuilder(); From i30817 at gmail.com Thu Sep 17 17:29:02 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 01:29:02 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> Message-ID: <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> Normaly a name tuple is like this : Last names, first name , Date It can also be like this Last name, first name or like this Name (for plato etc) I strip out the optional date (I should change the deciding algorithm to 2 non consecutive digits probably. Basically dates always seem to have a digit there) then exchange the first and last names if needed and join them again. If you want to keep them separate you can make a domain object or a list for that. BTW i just realized i don't need the recursion for nothing. I might change it. It takes a good 1.5 m to index the Gutenberg index even with a lot of hacks. From sly at victoria.tc.ca Thu Sep 17 17:33:13 2009 From: sly at victoria.tc.ca (Andrew Sly) Date: Thu, 17 Sep 2009 17:33:13 -0700 (PDT) Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> References: <4AAD7C04.5030806@perathoner.de> <4AAE819C.1040108@perathoner.de> <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> <4AAF5A12.8040505@perathoner.de> <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> Message-ID: Are you talking about reading the files directly from gutenberg.org? Files are served up with the encoding specified in the http header. I don't know the technical details--Marcello set it all up. --Andrew On Thu, 17 Sep 2009, Paulo Levi wrote: > Just a little input about text files and charsets (encodings), since i > had to use it for my program. Most browsers and applications open > these files in the correctly simply because someone (mostly mozilla) > did the hard work of making a fast guessing engine. I wouldn't be > amazed if it failed in some books. From i30817 at gmail.com Thu Sep 17 17:54:40 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 01:54:40 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> References: <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> Message-ID: <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> Actually a little bit of disinformation there: I do need the recursion. Titles (for instance) have an additional ",". Forgot. On Fri, Sep 18, 2009 at 1:29 AM, Paulo Levi wrote: > Normaly a name tuple is like this : > Last names, first name , Date > It can also be like this > Last name, first name > or like this > Name > (for plato etc) > > I strip out the optional date (I should change the deciding algorithm > to 2 non consecutive digits probably. Basically dates always seem to > have a digit there) > then exchange the first and last names if needed and join them again. > > If you want to keep them separate you can make a domain object or a > list for that. BTW i just realized i don't need the recursion for > nothing. I might change it. It takes a good 1.5 m to index the > Gutenberg index even with a lot of hacks. > From i30817 at gmail.com Thu Sep 17 18:18:44 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 02:18:44 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> References: <4AB126D2.2050706@perathoner.de> <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> Message-ID: <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> Oh i see the obvious error now (finally). How about a little different algorithm: strip out the date, then take the , suffix, prefix, sufix prefix until empty. From i30817 at gmail.com Thu Sep 17 18:33:09 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 02:33:09 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> References: <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> Message-ID: <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> Possibly this? BTW thanks for spotting that. private String normalizeName(String authorString) { int separator = authorString.lastIndexOf(','); //normal date seperator if (separator != -1) { String possibleDate = authorString.substring(separator + 1); for (int i = 0; i < possibleDate.length(); i++) { if (Character.isDigit(possibleDate.charAt(i))) { //a date, hopefully... return exchangeNames(authorString.substring(0, separator)); } } //no date, but change the name anyway. return exchangeNames(authorString); } return authorString; } private String exchangeNames(String authorString) { normalizeString.setLength(0); exchangeNamesAuxSuffix(authorString); return normalizeString.toString(); } private void exchangeNamesAuxSuffix(String authorString) { int seperator = authorString.lastIndexOf(','); if (seperator == -1) { normalizeString.append(authorString); return; } normalizeString.append(authorString.substring(seperator + 2)).append(' '); exchangeNamesAuxPrefix(authorString.substring(0, seperator)); } private void exchangeNamesAuxPrefix(String authorString) { int seperator = authorString.indexOf(','); if (seperator == -1) { normalizeString.append(authorString); return; } normalizeString.append(authorString.substring(0, seperator)).append(' '); exchangeNamesAuxSuffix(authorString.substring(seperator + 2)); } From jimad at msn.com Thu Sep 17 19:15:41 2009 From: jimad at msn.com (James Adcock) Date: Thu, 17 Sep 2009 19:15:41 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> References: <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> Message-ID: Sorry Paulo, I'm not sure what you are up to, but again, what do your algorithms actually find when applied to the author name examples I presented earlier? Sun Tzu Miguel de Cervantes Marquis de Sade From i30817 at gmail.com Thu Sep 17 19:24:45 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 03:24:45 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> Message-ID: <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> Sun Tzu apparently doesn't exist (it's probably as the original name. Searching for Art of War gives Sunzi as one of the names) Miguel de Cervantes - > Miguel de Cervantes Saavedra Marquis de Sade -> marquis de Sade (marquis is lowercase for some reason on the index). From i30817 at gmail.com Thu Sep 17 19:33:23 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 03:33:23 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> References: <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> Message-ID: <212322090909171933u660219f2u57df90004223717e@mail.gmail.com> This is not applied to the names you gave themselves, but as they appear on the index. Marquis de Sade for instance appears on the index as : "Sade, marquis de, 1740-1814". From i30817 at gmail.com Thu Sep 17 20:47:13 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 04:47:13 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909171933u660219f2u57df90004223717e@mail.gmail.com> References: <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> <212322090909171933u660219f2u57df90004223717e@mail.gmail.com> Message-ID: <212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com> Duh, still wrong. Wait a second, i will sort it out. From i30817 at gmail.com Thu Sep 17 20:57:47 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 18 Sep 2009 04:57:47 +0100 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com> References: <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> <212322090909171933u660219f2u57df90004223717e@mail.gmail.com> <212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com> Message-ID: <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com> You're right, it is inconsistent in some (apparently only titled) authors. For example: 1 name, title broken up. La Rochejaquelein, Marie-Louise-Victoire, marquise de, 1772-1857 versus: 2 names title intact (correct apparently since it is consistent with most of the rest of the names). Disraeli, Benjamin, Earl of Beaconsfield, 1804-1881 No way to recognize if it should be plain LIFO order or something else. From sly at victoria.tc.ca Thu Sep 17 21:54:08 2009 From: sly at victoria.tc.ca (Andrew Sly) Date: Thu, 17 Sep 2009 21:54:08 -0700 (PDT) Subject: [gutvol-d] Author names in catalog In-Reply-To: <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com> References: <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> <212322090909171933u660219f2u57df90004223717e@mail.gmail.com> <212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com> <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com> Message-ID: There is a separate cataloger's mailing list if you are interested in further discussion with the people who are editing the catalog. However, it might help if I tell you that most of the author headings follow the form used at the Library of Congress. And _they_ follow rules and vaguries that have built up over many decades. I can tell you without uncertainty that you will not be able to prepare a process which will give you 100% good results. --Andrew On Fri, 18 Sep 2009, Paulo Levi wrote: > You're right, it is inconsistent in some (apparently only titled) authors. > For example: > 1 name, title broken up. > La Rochejaquelein, Marie-Louise-Victoire, marquise de, 1772-1857 > From jimad at msn.com Thu Sep 17 22:11:43 2009 From: jimad at msn.com (James Adcock) Date: Thu, 17 Sep 2009 22:11:43 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com> References: <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> <212322090909171933u660219f2u57df90004223717e@mail.gmail.com> <212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com> <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com> Message-ID: I hope you have figured out my point by now: Namely, IF one wants to make "correct" e-book files in a number of formats, including EPUB and MOBI, it is not possible algorithmically to determine the "correct" encoding of Author Lastname, Firstname from data currently found in either the PG HTML encodings nor the PG TXT encodings. It is also not possible to make "correct" encodings of Author Lastname, Firstname from the information currently recorded in the PG catalog. One would like to have "correct" encodings of Author Lastname, Firstname so that if a customer adds a PG text in say EPUB or MOBI to their existing collection of e-book titles in their e-book library, it would be nice if the Author Lastname, Firstname sorts and displays correctly next to any other e-books they might already possess from other sources. Sun Tzu Sun is the author's family name, or what is represented as an authors "Lastname" in western cultures. Tzu is a romanization of an honorarium such as "Sir" or "Mr" Sun Tzu ??; S?n Z?; Which is listed in a westernized corrupted form in the PG catalog as "Sunzi" which shows lack of cultural respect -- combining the family name with the honorarium in a way to artificially form an apparent feminine. However, I believe the transcriber needs to transcribe the book as written, including the spelling or representation of the author name found there, which means that the book transcription in HTML or PG TXT cannot be used as a reliable source of author name -- nor should the spelling given in transcription necessarily be how the author is listed in the PG catalog. Nor can it algorithmically be thus possible to figure out what part therein is the "last name [family name]" So therefore in addition to the coding in the HTML or the PG TXT there also needs to be a "spine" representation that gives a correct canonical identification of author "Lastname: Sun Firstname: Tzu" where again Tzu isn't really the first name, but by traditional this slot gets used for that part of the canonical author name representation which isn't the lastname. "Art of War" also being known simply as "The Sun Tzu." Miguel de Cervantes Last name of author is actually most often canonically represented as "Cervantes Saavedra", with the "firstname" part typically represented as "Miguel de". Saavedra being mother's last name in a culture where children bear their mother's name but when the book is sold in other cultures that are uncomfortable with this convention then the Saavedra tends to get dropped -- but shouldn't be because it IS the author's last name. Marquis de Sade Last name of author = Sade. First name part is "Donatien Alphonse Franc?ois". But by tradition customers are probably expecting the firstname part to be represented as "Marquis de" -- they almost certainly will not recognize "Donatien Alphonse Franc?ois". So its not real clear how the firstname part ought be coded, but if the lastname part is coded as Sade then at least the book will show up about the right place in the possessor's library listing. Again, the point being neither the PG catalog nor the literal transcription can be used as a reliable source of the author lastname, firstname information -- which DOES need to be reliably included in the e-book file so that the e-book will show up at correct location in the customer's e-book library sort. From Bowerbird at aol.com Fri Sep 18 01:49:57 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 18 Sep 2009 04:49:57 EDT Subject: [gutvol-d] good news on the doorstep Message-ID: good news on the doorstep. there's now a new listserve for "civil" p.g. volunteers. that's right, no need to put up with the rudeness of this list. > http://groups.yahoo.com/group/PG_vol_lounge/message/1 here's the introduction, from founder joyce wilson: > Welcome to the Project Gutenberg volunteer lounge! > This list is intended to provide a friendly, supportive, > and civil forum for PG volunteers. I hope it will provide > a sense of community connection for those of us > whose PG volunteer jobs can seem kind of isolated. > Problem-solve, discuss, ask questions, share good news, > brag on yourself and others, liberally apply congratulations > and back-pats, chit-chat about stuff. But don't be a jerk, > or you'll be removed from the list. so, now if you're feeling a need for some liberally-applied congratulations and back-pats, or some shared good news, without jerks pestering you, you'll know exactly where to go. thanks for provided this much-needed service, joyce! -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From richfield at telkomsa.net Fri Sep 18 04:51:05 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Fri, 18 Sep 2009 13:51:05 +0200 Subject: [gutvol-d] Name lists and Big-endianism Message-ID: <4AB37429.5090403@telkomsa.net> Without due respect for the dead hand of history, or the dead heads of aesthetes trying to impose attractive schemes devoid of logic or practicality, it would be nice if we could agree on some scheme to sequence our author indexes. It won't happen of course, and I am not silly enough to think that this brief note contains anything conclusive, but give it a thunk, anyone interested. Anyone uninterested is sternly forbidden to consider the matter or read this remark (it hardly hopes to attain the dignity of a suggestion.) Let us assume that we have authors such as the famous Johanna Kakebeenwania van der Merwe O'Brien, Jolien Gertina van der Poel O'Mally, Paulette Marmorella Bridhedia Paul-Ewen Truupsvor Theooseov Swizarminife Neville McSnurtle Quentin Urtel Xavier Ypres Zulrich ?rtur Aspoestertjie Sinnerella Katrina van Aswagen Gehardus Johannes Katwimpers Janse van Vuuren van den Heever Johannes Gehardus du Toit van der Vyfer Jakobus Johannes Joumoerus Vandaaigoed Lelie Belladonna Nerina Vanderker Otto Werther von und zu Bismarkharing The problem is notionally to sequence them according to a comprehensible and totally unambiguous scheme, with the least sensitivity to uncertain spellling and concentrations of initial letters etc. The best approach is to write each name, as much as desired in normal internal sequence as above, then split each name immediately after the last non-alphabetic character (including spaces). The bit at the end is what you sequence by, NOT the full name, NOT necessarily the full surname, and without consideration of case or diacritical signs. In our by no means random, but hardly unrealistic example,several questions arise, including the role of various non-alphabetic characters, and the artificial concentration of surnames under the initial letters of prefixes such as de, der, du, van, van der, von den, and no end of etcs. By sorting by the terminal alphabetic string, we remove ambiguity and even out the spread of names through the alphabet. In simple information theory this optimises search time and sort efficiency. The above example becomes: Aswagen Aspoestertjie Sinnerella Katrina van Bismarkharing Otto Werther von und zu Brien Johanna Kakebeenwania van der Merwe O' Ewen Paulette Marmorella Bridhedia Paul- Heever Gehardus Johannes Katwimpers Janse van Vuuren van den Mally Jolien Gertina van der Poel O' Swizarminife Truupsvor Theooseov Urtel Neville McSnurtle Quentin ?rtur Xavier Ypres Zulrich Vandaaigoed Jakobus Johannes Joumoerus Vanderker Lelie Belladonna Nerina Vyfer Johannes Gehardus du Toit van der The head benefit is in the de tailing. Not that anyone asked. Cheers, Jon From walter.van.holst at xs4all.nl Fri Sep 18 07:51:12 2009 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Fri, 18 Sep 2009 16:51:12 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <4AB37429.5090403@telkomsa.net> References: <4AB37429.5090403@telkomsa.net> Message-ID: <4AB39E60.6030108@xs4all.nl> Jon Richfield schreef: > In our by no means random, but hardly unrealistic example,several > questions arise, including the role of various non-alphabetic > characters, and the artificial concentration of surnames under the > initial letters of prefixes such as de, der, du, van, van der, von den, > and no end of etcs. By sorting by the terminal alphabetic string, we > remove ambiguity and even out the spread of names through the alphabet. > In simple information theory this optimises search time and sort > efficiency. The above example becomes: > > Aswagen Aspoestertjie Sinnerella Katrina van > > Bismarkharing Otto Werther von und zu > > Brien Johanna Kakebeenwania van der Merwe O' > > Ewen Paulette Marmorella Bridhedia Paul- > > Heever Gehardus Johannes Katwimpers Janse van Vuuren van den > > Mally Jolien Gertina van der Poel O' > > Swizarminife Truupsvor Theooseov > > Urtel Neville McSnurtle Quentin > > ?rtur Xavier Ypres Zulrich > > Vandaaigoed Jakobus Johannes Joumoerus > > Vanderker Lelie Belladonna Nerina > > Vyfer Johannes Gehardus du Toit van der > > > The head benefit is in the de tailing. > > Not that anyone asked. Since you've picked a bunch of mostly Dutch and German authors or at least authors whose ancestors happened to be Dutch or German, I'd like to point out that a rather common way in Dutch databases is to do it slightly different: Sinerella Katrina van Aswagen Aspostertjie would become: Aswagen Aspostertjie, van, Sinerella Katrina This prevents alphabetically sorting all surnames from becoming a massive series of entries starting with a 'V'. I'm rather sure Marcello can provide the answer on whether our Eastern brethren do it the same. Regards, Walter van Holst From richfield at telkomsa.net Fri Sep 18 08:29:14 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Fri, 18 Sep 2009 17:29:14 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <4AB39E60.6030108@xs4all.nl> References: <4AB37429.5090403@telkomsa.net> <4AB39E60.6030108@xs4all.nl> Message-ID: <4AB3A74A.6080103@telkomsa.net> Dag Walter, bly te kenne! In South Africa there are indeed strong Dutch as well as other Germanic influences, and nowhere more so than in our surnames (especially Afrikaans surnames of course). Van, van der, von, van den, ter, ten, etc. Van and van der are easily the leaders though. We do however have a strong Huguenot influences (de, du, even a few le etc) and don't forget the Irish O', though they are not as prominent as in say, the US. Also, for similar reasons some black names begin with U, N, or M. We also have Portuguese names (Del...) And yes, the reason you mention is exactly the one I had in mind. Especially in certain districts where certain families settled and established a patronymic dominance that became a local source of pervasive inconvenience and perverse pride. (There sometimes are problems with the family forenames as well; schools and universities have been driven to distinguish between particular students by date of birth!) And thereby hang various tales, variously amusing... I am not quite certain of the DB convention you mention though. Are you sure that you didn't have some finger trouble? "Aspoestertjie Sinnerella Katrina van Aswagen" becomes "Aswagen Aspostertjie, van, Sinerella Katrina"??? Isn't that a bit pointlessly arbitrary, devious, even obscure? If it is indeed the convention, then so be it, but I would think that the rotation scheme I proposed has major advantages. For one thing it puts the Driscols Benny O' in their places, along with the Drifters Benny Smith- and the Diemans Benny van. Mooi bly! Jon > Jon Richfield schreef: >> In our by no means random, but hardly unrealistic example,several >> questions arise, including the role of various non-alphabetic >> characters, and the artificial concentration of surnames under the >> initial letters of prefixes such as de, der, du, van, van der, von >> den, and no end of etcs. By sorting by the terminal alphabetic >> string, we remove ambiguity and even out the spread of names through >> the alphabet. In simple information theory this optimises search time >> and sort efficiency. The above example becomes: >> >> Aswagen Aspoestertjie Sinnerella Katrina van >> >> Bismarkharing Otto Werther von und zu >> >> Brien Johanna Kakebeenwania van der Merwe O' >> >> Ewen Paulette Marmorella Bridhedia Paul- >> >> Heever Gehardus Johannes Katwimpers Janse van Vuuren van den >> >> Mally Jolien Gertina van der Poel O' >> >> Swizarminife Truupsvor Theooseov >> >> Urtel Neville McSnurtle Quentin >> >> ?rtur Xavier Ypres Zulrich >> >> Vandaaigoed Jakobus Johannes Joumoerus >> >> Vanderker Lelie Belladonna Nerina >> >> Vyfer Johannes Gehardus du Toit van der >> >> >> The head benefit is in the de tailing. >> >> Not that anyone asked. > > Since you've picked a bunch of mostly Dutch and German authors or at > least authors whose ancestors happened to be Dutch or German, I'd like > to point out that a rather common way in Dutch databases is to do it > slightly different: > > Sinerella Katrina van Aswagen Aspostertjie would become: > > Aswagen Aspostertjie, van, Sinerella Katrina > > This prevents alphabetically sorting all surnames from becoming a > massive series of entries starting with a 'V'. > > I'm rather sure Marcello can provide the answer on whether our Eastern > brethren do it the same. > > Regards, > > Walter van Holst > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > From prosfilaes at gmail.com Fri Sep 18 10:53:11 2009 From: prosfilaes at gmail.com (David Starner) Date: Fri, 18 Sep 2009 13:53:11 -0400 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: <6d99d1fd0909181053s5ca3ae67sa21e4923e6ed4f21@mail.gmail.com> On Thu, Sep 17, 2009 at 1:25 PM, James Adcock wrote: > Since Bowerbird claims it is easy to go from PG > TXT to HTML, then certainly it is as easy to go from HTML to PG TXT That does not follow; it is easy to go from PG TXT to paper, but not so easy to go the other way. -- Kie ekzistas vivo, ekzistas espero. From lee at novomail.net Fri Sep 18 16:30:48 2009 From: lee at novomail.net (Lee Passey) Date: Fri, 18 Sep 2009 17:30:48 -0600 Subject: [gutvol-d] Re: PG French text file #1500 In-Reply-To: References: Message-ID: <4AB41828.2000908@novomail.net> Michael S. Hart wrote: > Right now we are looking at Voltaire, de Toqueville's "Democracy," > and a few others. > > 20 more and we are at 1500. > > Please take a look for various copies of "Democracy" and anything > else you think we might be able to use, and let me know. > http://fr.wikisource.org/wiki/De_la_d%C3%A9mocratie_en_Am%C3%A9rique Wikisource claims to have over 50,000 French works, although I notice a fair number of them are works translated from other languges (e.g., H.G. Wells' classic _La Guerre des Mondes_). Happy Harvesting! From sly at victoria.tc.ca Fri Sep 18 23:28:42 2009 From: sly at victoria.tc.ca (Andrew Sly) Date: Fri, 18 Sep 2009 23:28:42 -0700 (PDT) Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> <212322090909171933u660219f2u57df90004223717e@mail.gmail.com> <212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com> <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com> Message-ID: I think that with the few examples you have given, you have shown that it is not possible to do so with _any_ library catalog, because the usage of names has so many variations and exceptions. --Andrew On Thu, 17 Sep 2009, James Adcock wrote: > I hope you have figured out my point by now: Namely, IF one wants to make "correct" e-book files in a number of formats, including EPUB and MOBI, it is not possible algorithmically to determine the "correct" encoding of Author Lastname, Firstname from data currently found in either the PG HTML encodings nor the PG TXT encodings. From sly at victoria.tc.ca Fri Sep 18 23:39:37 2009 From: sly at victoria.tc.ca (Andrew Sly) Date: Fri, 18 Sep 2009 23:39:37 -0700 (PDT) Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <4AB37429.5090403@telkomsa.net> References: <4AB37429.5090403@telkomsa.net> Message-ID: And don't forget that other national traditions you can have more confusion. For example: For Hungarian names, the preferred order is [Family name] [Given name] So that in the main text of PG#19433 the author's name is given as: Balazs Bela, with the understanding that the first name that appears is the one we alphbetize by. And in Icelandic names, what looks to us as a "last name" is not actually a family name, but a patrynomic. It is incorrect to alphabetize by that, so the given name is used instead. --Andrew On Fri, 18 Sep 2009, Jon Richfield wrote: > Without due respect for the dead hand of history, or the dead heads of > aesthetes trying to impose attractive schemes devoid of logic or > practicality, it would be nice if we could agree on some scheme to > sequence our author indexes. It won't happen of course, and I am not > silly enough to think that this brief note contains anything conclusive, > but give it a thunk, anyone interested. > Anyone uninterested is sternly forbidden to consider the matter or read > this remark (it hardly hopes to attain the dignity of a suggestion.) > > Let us assume that we have authors such as the famous > > Johanna Kakebeenwania van der Merwe O'Brien, > Jolien Gertina van der Poel > O'Mally, > Paulette Marmorella Bridhedia Paul-Ewen > Truupsvor Theooseov > Swizarminife > Neville McSnurtle Quentin Urtel > Xavier Ypres Zulrich > ?rtur > Aspoestertjie Sinnerella Katrina van > Aswagen > Gehardus Johannes Katwimpers Janse van Vuuren van den Heever > Johannes Gehardus du Toit van > der Vyfer > Jakobus Johannes Joumoerus Vandaaigoed > Lelie Belladonna Nerina > Vanderker > Otto > Werther von > und zu Bismarkharing > > The problem is notionally to sequence them according to a > comprehensible and totally unambiguous scheme, with the least > sensitivity to uncertain spellling and concentrations of initial letters > etc. > The best approach is to write each name, as much as desired in normal > internal sequence as above, then split each name immediately after the > last non-alphabetic character (including spaces). The bit at the end is > what you sequence by, NOT the full name, NOT necessarily the full > surname, and without consideration of case or diacritical signs. > > In our by no means random, but hardly unrealistic example,several > questions arise, including the role of various non-alphabetic > characters, and the artificial concentration of surnames under the > initial letters of prefixes such as de, der, du, van, van der, von den, > and no end of etcs. By sorting by the terminal alphabetic string, we > remove ambiguity and even out the spread of names through the alphabet. > In simple information theory this optimises search time and sort > efficiency. The above example becomes: > > Aswagen Aspoestertjie Sinnerella Katrina van > > Bismarkharing Otto Werther von und zu > > Brien Johanna Kakebeenwania van der Merwe O' > > Ewen Paulette Marmorella Bridhedia Paul- > > Heever Gehardus Johannes Katwimpers Janse van Vuuren van den > > Mally Jolien Gertina van der Poel O' > > Swizarminife Truupsvor Theooseov > > Urtel Neville McSnurtle Quentin > > ?rtur Xavier Ypres Zulrich > > Vandaaigoed Jakobus Johannes Joumoerus > > Vanderker Lelie Belladonna Nerina > > Vyfer Johannes Gehardus du Toit van der > > > The head benefit is in the de tailing. > > Not that anyone asked. > > Cheers, > > Jon > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From Bowerbird at aol.com Sat Sep 19 00:47:44 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 19 Sep 2009 03:47:44 EDT Subject: [gutvol-d] re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > I challenge you to write such a program > to run on _my_ choice of platform: Kindle DX i wonder, jim, if you are being disingenuous on purpose? or are you just incapable of having an honest discussion? because i'm sure you know you "challenged" me to do something that cannot be done. and as much as you might like to say "that's the point", it's really _not_... because nobody else can meet that challenge either. therefore, the obvious response to this "challenge" is to do what one _can_ do, which is to custom-tailor a beautiful e-book which _can_ be read on the dx... and since the dx reads .pdf files, that's very simple! so your big "challenge" is whisked away immediately, in yet another testament to the power of plain-text... z.m.l. can create very beautiful (and powerful) .pdf, fully customized to your own personal preferences, all with just the click of a button. from a z.m.l. file. (or a p.g. plain-text file you've modified into z.m.l.) by using a program on your own personal computer. you can choose any font you like from your computer. and the leading you want. and the margins you want. and any font-size. and any font-color. and so on... it's the ultimate in personal control. people like that. and if you prefer native kindle format instead of .pdf? just do the zml-to-html conversion instead, where the .html is generated to your own personal preferences, and then convert that .html file to the kindle format... you don't have to "put up with" the settings from some online website, and hope that they'll be "good enough", since you probably cannot change them if they aren't... you get it the way you want, using your own machine, with software that you know will keep working on it... what's not to like? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Sep 19 02:08:55 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 19 Sep 2009 05:08:55 EDT Subject: [gutvol-d] a case of deliberate sabotage by a p.g. volunteer Message-ID: jim, i just took a look at "wings of the dove" -- pg#29452, which you post-processed -- and i'm troubled by what i discovered there. ok, maybe "troubled" is a bit melodramatic, much like the subject-line on this post, but i don't really think it's all _that_ farfetched... what i found is that, in the .html version of the book, you showed the italics properly... good job. in the .txt version, however, you deliberately deleted the italics markers which formatters had invested considerable work in inserting... bad job! the text version -- this was the 8-bit file -- was also missing a handful of diacritics in it: > Seen at a foreign table d'h?te, he suggested > Br?nig (several cases of this one) > word--? bient?t!--across > You're blas?, but you're not enlightened. > wasn't it, ? peu pr?s, what all > Matcham were inesp?r?es, were pure manna in some books, those missing diacritics would be a big issue. here, they're fairly uncommon, and thus do not really constitute a very big deal. but the missing italics? they are a major problem. and it's a problem that _you_ introduced yourself. you're supposed to use underscores for the italics. (go ahead, read the instructions, it says it clearly.) you're definitely _not_ supposed to remove them! and i must say, it takes a lot of gall for you to do this deliberate sabotage of the plain-text file and _then_ come here to complain because that file is inferior... of course it's inferior! you made it so! you went out of your way to make it substandard. if i would've done that, i'd be ashamed of myself. i'm also disturbed that the whitewashers allowed this intentionally-disfigured file into the library... but that's another matter, a fight for another day. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Sep 19 02:34:59 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 19 Sep 2009 05:34:59 EDT Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer Message-ID: ok, it looks like the diacritic problem was an encoding glitch on my end so i apologize for that, and take it back. but the missing italics still loom large. i might also throw in a few other notes. first off, there's no table of contents in these files, either the .html or the .txt... i consider that to be plain unacceptable. it's easy enough to make, and it's useful. besides, it orients the reader to the book. i even like backlinks from chapter heads to the table of contents, for quick navigation, and previous/next chapter links are nice too. and, since there has been some discussion about title-pages, i think the title-page in the .html version is done poorly, because it's too widely spaced, which means that it needs about two screens on my monitor. (and i have a 23-inch cinema-screen here.) the whole thing needs much tighter leading. it certainly doesn't feel like a "real" title-page. oh yeah, and the .epub version of the book? no side margins at all... it looks freakish... which means that if you don't like that look, you're gonna need a reader-program which "allows" you to adjust the margins yourself. just so you know... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Sep 19 13:36:57 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 19 Sep 2009 16:36:57 EDT Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer Message-ID: so, jim, i'm gonna give "wings of the dove" the whole z.m.l. treatment, so you can see it. i'm assuming that you used the copy from the university of california, at archive.org? > http://ia310110.us.archive.org/1/items/wingsofthedove01jamerich/ the other copy, from university of toronto, appears to have a 1909 publication date... by the way, do you have a copy of your file _before_ you rewrapped it, with the original linebreaks? because that would help me lots. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From richfield at telkomsa.net Sun Sep 20 04:19:44 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Sun, 20 Sep 2009 13:19:44 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: References: <4AB37429.5090403@telkomsa.net> Message-ID: <4AB60FD0.3060009@telkomsa.net> Yes, I have a book by one Peter Rosza. It took me some time to realise that Peter was in fact a woman, and a well-known mathematician at that, whom we might have called Rose Peter. Tsk! These Magyars...! You'd think they would have come to us for advice. As for the Icelandic convention, I knew that there was something funny about all their terminal "-sons" and "-dotters" (sp?) but don't they have any family name at all? Some of the Slavic names might be troublesome too, because they vary the suffix of what I take to be the family name, according to gender: -ski vs -ska and so on. But maybe I have that mixed up as in the Icelandic names. Could it be that the Icelandic convention derives from the fact that they are dealing with a smallish population? Anyway, It seems to me that the indexing convention I proposed would still be easy to apply by anyone that understands the naming convention of the language and the population in question. Simply write the complete name (or whatever part suits the DB in question) in the lexically normal way according to the favoured convention, then rotate it till the first letter after the last non-alphabetic character is first in the string, and voila! Go well, Jon Andrew Sly wrote: > And don't forget that other national traditions you can have > more confusion. > > For example: > > For Hungarian names, the preferred order is [Family name] [Given name] > So that in the main text of PG#19433 the author's name is given as: > Balazs Bela, with the understanding that the first name that appears > is the one we alphbetize by. > > And in Icelandic names, what looks to us as a "last name" is > not actually a family name, but a patrynomic. It is incorrect > to alphabetize by that, so the given name is used instead. > > --Andrew > > > From publiek.devos at skynet.be Sun Sep 20 11:35:40 2009 From: publiek.devos at skynet.be (Frits Devos) Date: Sun, 20 Sep 2009 20:35:40 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <4AB60FD0.3060009@telkomsa.net> References: <4AB37429.5090403@telkomsa.net> <4AB60FD0.3060009@telkomsa.net> Message-ID: <5C1815AD-1652-47D3-9C63-BA5F4F03B646@skynet.be> Whole can of worms. In the Netherlands Walter van Holst (If you allow me to use your name, Walter) would be sorted by "Holst" . In Belgium, using the same language, it would be sorted by "van Holst" (and the "van" would be capitalised). Frits Op 20-sep-09, om 13:19 heeft Jon Richfield het volgende geschreven: > Yes, I have a book by one Peter Rosza. It took me some time to > realise that Peter was in fact a woman, and a well-known > mathematician at that, whom we might have called Rose Peter. > Tsk! These Magyars...! You'd think they would have come to us for > advice. > > As for the Icelandic convention, I knew that there was something > funny about all their terminal "-sons" and "-dotters" (sp?) but > don't they have any family name at all? > Some of the Slavic names might be troublesome too, because they > vary the suffix of what I take to be the family name, according to > gender: -ski vs -ska and so on. But maybe I have that mixed up as > in the Icelandic names. > Could it be that the Icelandic convention derives from the fact > that they are dealing with a smallish population? > Anyway, It seems to me that the indexing convention I proposed > would still be easy to apply by anyone that understands the naming > convention of the language and the population in question. Simply > write the complete name (or whatever part suits the DB in > question) in the lexically normal way according to the favoured > convention, then rotate it till the first letter after the last non- > alphabetic character is first in the string, and voila! > > Go well, > > Jonm > > Andrew Sly wrote: >> And don't forget that other national traditions you can have >> more confusion. >> >> For example: >> >> For Hungarian names, the preferred order is [Family name] [Given >> name] >> So that in the main text of PG#19433 the author's name is given as: >> Balazs Bela, with the understanding that the first name that appears >> is the one we alphbetize by. >> >> And in Icelandic names, what looks to us as a "last name" is >> not actually a family name, but a patrynomic. It is incorrect >> to alphabetize by that, so the given name is used instead. >> >> --Andrew >> >> >> > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From Bowerbird at aol.com Sun Sep 20 14:52:24 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 20 Sep 2009 17:52:24 EDT Subject: [gutvol-d] Re: Name lists and Big-endianism Message-ID: and we see yet another excellent example of how the "metadata" b.s. is such an unproductive path. the o.c.d. people love to focus on these minute details, which make very little difference at all -- who cares how "van holst" is sorted?, or if the "van" is capitalized or not?, or indeed whether it is "capitalised" or not?, because a search for "holst" is gonna find it no matter what you do -- and, as if this insignificance wasn't bad enough, such compulsiveness usually causes full paralysis. you can tie yourself up worrying about that crap... or you can cut the gordian knot and be productive. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Sep 20 15:50:19 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 20 Sep 2009 18:50:19 EDT Subject: [gutvol-d] the wings of the dove -- 001 Message-ID: hey jim, if you're still around (and you should be), then you'll want to know i've started on that book, "the wings of the dove", by henry james. i've mounted it on my website; you can find it here: > http://z-m-l.com/go/wotdj/wotdjp123.html that's a page-by-page rendering, with a form at the bottom of each page for _reporting_errors_... you will see that the text for each page is given, as well as the scan for that page, for verification. this is in keeping with your excellent suggestion that p.g. should mount a book in such a way that other people can come along later to improve it. both the text and the scans are from archive.org. (i used the scan from the university of california.) as many people here undoubtedly know, the o.c.r. done by the o.c.a. is dreadful. most people likely think it's dreadful in the way that much o.c.r. is, namely, that it's filled with misrecognition errors. but the o.c.r. from the o.c.a. is worse. much worse. that's because their tech people there mishandle it. specifically, they _lose_ the em-dashes in the text! oh, the o.c.r. recognizes the em-dashes, but then -- somewhere in their file-handling workflow -- the o.c.a. "tech people" there lose the em-dashes! for example, look at page 9 from the book: > http://z-m-l.com/go/wotdj/wotdjp009.html you'll see that the em-dashes in the last paragraph have been dropped from the text. it's unbelievable! this problem _alone_ is enough to make the o.c.r. totally unworkable. but this isn't the only problem. (i tried to restore the em-dashes programmatically, by coding a tool, but it's less work to redo the o.c.r.) that's not all. there are more problems... if you look at the text more closely, you will see that the techies also lost the apostrophes in contractions! i did a few global changes to restore _some_ of 'em, like in the contraction "i'm", but i didn't fix them all... this is not a problem that is _common_ to o.c.a. books, but it's not a _rare_ occurrence either. stunning idiocy. and further, the hyphens on end-line hyphenates are missing as well! this sometimes happens in the o.c.r., so i'm not sure if that's what happened with this book or if end-line hyphens were lost in the o.c.a. workflow, but whatever it was, damage to the text is considerable. and like i said, these problems are rather pervasive... it's really ridiculous -- and quite sad -- that the people in charge of the technology over at the o.c.r. are idiots who have built a workflow that actually damages text... what's even worse is that -- when i've brought this to their attention -- they've responded with ad hominem attacks on _me_, as if _i_ were the guilty perpetrator... eventually, when i persisted, they finally consented to solve the worst of the problems -- the em-dashes -- but i don't know if they ever did solve the problem... meanwhile, they've banned me from their listserves, so they wouldn't have to listen to my persistent posts. talk about killing the messenger! it's appalling... the main reason this is so troubling is that the o.c.a. are supposed to be "the good guys", who are the only competitor to google. and they are badly incompetent. plus they have thin skins to boot, and they would rather _silence_ the people who point out their problems than do the work that will solve their self-induced problems. this does not bode well for our future. not well at all... anyway... the main benefit of the o.c.r. from the o.c.a. is that it has retained the structure from the original book. that means we can use a clever mash-up tool (requiring lots of elbow-grease) to use the cleaned-up text from jim and hang it on the structural scaffolding of the o.c.a. text. first, let's look at the o.c.a. text: > http://z-m-l.com/go/wotdj/wotdj.zml this is the single-file version of the .zml text for this book, the file that was used to generate the page-by-page view... now, in a separate window side-by-side with the above, let's load in (a slightly reworked version of) jim's text: > http://z-m-l.com/go/wotdj/wotdj.txt you'll see that you're able to match of the paragraphs... for instance, do a search for "she looked about her and" to find that paragraph in both windows, to sync them... so essentially, what we want to do is take the linebreaks and pagebreaks from the o.c.a. file and inject them into the (clean) e-text. we want to reintroduce the structure. (which p.g. should have never stripped in the first place.) or, to look at it in the other direction, we want to replace all of the incorrect lines of text in the o.c.a. version with the good, cleaned equivalent text from the p.g. version... once we do that, we'll have a good clean structured e-text. we'll get to that this week. gotta get some vitamin d now... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkloos at dodo.com.au Sun Sep 20 16:07:26 2009 From: kkloos at dodo.com.au (Keith Kloosterman) Date: Mon, 21 Sep 2009 09:07:26 +1000 Subject: [gutvol-d] Mailing list Message-ID: <4AB6B5AE.5080006@dodo.com.au> Hi, Please remove me from this mailing list. Thank you. Keith Kloosterman -- I am using the free version of SPAMfighter. We are a community of 6 million users fighting spam. SPAMfighter has removed 10 of my spam emails to date. Get the free SPAMfighter here: http://www.spamfighter.com/len The Professional version does not have this message --- avast! Antivirus: Outbound message clean. Virus Database (VPS): 090920-0, 20/09/2009 Tested on: 21/09/2009 9:07:27 AM avast! - copyright (c) 1988-2009 ALWIL Software. http://www.avast.com From ajhaines at shaw.ca Sun Sep 20 16:19:09 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sun, 20 Sep 2009 16:19:09 -0700 Subject: [gutvol-d] Re: Mailing list References: <4AB6B5AE.5080006@dodo.com.au> Message-ID: <55EE0A3DC0444E169B1B2F96DA899AB9@alp2400> Keith, you can remove yourself from this, or any other, PG list: - go to http://lists.pglaf.org/mailman/listinfo - click on the link to the list you want to remove yourself from, gutvol-d, in this case - at the bottom of the resulting page, you'll see an "Unsubscribe or edit options" button - enter your email address at the prompt to its left, and click the button. Al ----- Original Message ----- From: "Keith Kloosterman" To: Sent: Sunday, September 20, 2009 4:07 PM Subject: [gutvol-d] Mailing list > Hi, > > Please remove me from this mailing list. > > Thank you. > > Keith Kloosterman > > > -- > I am using the free version of SPAMfighter. > We are a community of 6 million users fighting spam. > SPAMfighter has removed 10 of my spam emails to date. > Get the free SPAMfighter here: http://www.spamfighter.com/len > > The Professional version does not have this message > > > --- > avast! Antivirus: Outbound message clean. > Virus Database (VPS): 090920-0, 20/09/2009 > Tested on: 21/09/2009 9:07:27 AM > avast! - copyright (c) 1988-2009 ALWIL Software. > http://www.avast.com > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From joyce.b.wilson at sbcglobal.net Sun Sep 20 20:39:35 2009 From: joyce.b.wilson at sbcglobal.net (Joyce Wilson) Date: Sun, 20 Sep 2009 22:39:35 -0500 Subject: [gutvol-d] Re: Name lists and Big-endianism Message-ID: <4AB6F577.2030105@sbcglobal.net> The change I would like is to have spaces taken into account in the name sort. So we would have something like this: Green, Alice Green, Robert Greenacre, Janet Greenjeans, Mr. instead of like this: Greenacre, Janet Green, Alice Greenjeans, Mr. Green, Robert --Joyce From schultzk at uni-trier.de Mon Sep 21 00:14:20 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 21 Sep 2009 09:14:20 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: References: Message-ID: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de> Hi There, Am 20.09.2009 um 23:52 schrieb Bowerbird at aol.com: > and we see yet another excellent example of how > the "metadata" b.s. is such an unproductive path. Not true. It is how the metedata is use or structured. See Below. > > the o.c.d. people love to focus on these minute > details, which make very little difference at all > -- who cares how "van holst" is sorted?, or if the > "van" is capitalized or not?, or indeed whether > it is "capitalised" or not?, because a search for > "holst" is gonna find it no matter what you do -- > and, as if this insignificance wasn't bad enough, > such compulsiveness usually causes full paralysis. Here BB is right on the point. Basically, the metadata is a dataabase. so we have the field for the name and then one or several fields of indexing that field. Furthermore in a typical library cataloge you wil find "Walter van Holst" under "Walter van Holst", "van Hols, Walter" and "Holst, van, Walter". So where doe sit leave us? With the development of a structured databese. Which means that we will have to comprise, that is cover the basic cases and in certain cases hand edit the fields involved. These special cases will be harder to find, but there will be a set of rules which will help us look for them. To make things easier we could use cross- references as in library catalogues. There is no magic bullet. As aexample take look at iTunes. It has field for sorting Artist. they use a db and for my own CDs the information is gotten from a diferent DB. I have my own notion how things should be sorted. So I edit the "sort for Artist" field. The only problem here is that for classical music sorting/ indexing by Artist is not viable. I prefer to use the Komposer field. So I have to use a different index. So what should be done is say our index follow these rules for names. If you cannot find a name where you expect it to be search do a full text search of the field X and you should find what you are looking for if not use the full name field !!! regards Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: From walter.van.holst at xs4all.nl Mon Sep 21 00:30:33 2009 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Mon, 21 Sep 2009 09:30:33 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de> References: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de> Message-ID: <2aef272c5d38121baf7069b06bc76554@xs4all.nl> On Mon, 21 Sep 2009 09:14:20 +0200, "Keith J. Schultz" wrote: >> the o.c.d. people love to focus on these minute >> details, which make very little difference at all >> -- who cares how "van holst" is sorted?, or if the >> "van" is capitalized or not?, or indeed whether >> it is "capitalised" or not?, because a search for >> "holst" is gonna find it no matter what you do -- >> and, as if this insignificance wasn't bad enough, >> such compulsiveness usually causes full paralysis. > Here BB is right on the point. Not quite. If I am looking for a book written by a particular author, I want to be able to search for his or her name and not for all books about that particular author. Therefore metadata has a, albeit in this era of sophisticated search algorithms, somewhat reduced, purpose. And to that particular bird that is usually relegated to my spambox: I really do care whether the 'van' part in my family names is capitalised or not. I'm rather proud of it and do not need beastly pseudonyms to cower behind. Regards, Walter From schultzk at uni-trier.de Mon Sep 21 00:51:00 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 21 Sep 2009 09:51:00 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <4AB6F577.2030105@sbcglobal.net> References: <4AB6F577.2030105@sbcglobal.net> Message-ID: Hi There, Am 21.09.2009 um 05:39 schrieb Joyce Wilson: > The change I would like is to have spaces taken into account in the > name sort. > > So we would have something like this: > > Green, Alice > Green, Robert > Greenacre, Janet > Greenjeans, Mr. > > instead of like this: > > Greenacre, Janet > Green, Alice > Greenjeans, Mr. > Green, Robert Duhhhh !! If this is true there are some people that ougth to take a course in 101 programming or db design. It takes about 5 minutes to write the code. IsEntrySmallertThan(X, Y) :- Pos := 0; If (Length(X) < Length (Y)) then MaxPos = Length(X) -1; else MaxPos = Length(Y) - 1; end if While ((IsSmaller := CharSmaller(X[Pos], Y[Pos]) == 0) and Pos != MaxPos ) Pos := Pos +1; end While return IsSmaller; end EntrySmallerThan CharAtSmaller(X, Y) :- If (Cardinal(X) < Cardinal(Y) ) return 1 else If Cardinal(X) > Cardinal(Y) then return -1; else return 0; end if end if end CahrAtSmaller Cardinal(X) :- If (X in set of standard Chars) then return X else return -1; end if end Cardinal Put this ipseudo code what language you want and voila. Cardinal can be made as complex as you want if you needed finer distinctions. regards Keith. From marcello at perathoner.de Mon Sep 21 09:06:27 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 21 Sep 2009 18:06:27 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <4AB6F577.2030105@sbcglobal.net> References: <4AB6F577.2030105@sbcglobal.net> Message-ID: <4AB7A483.5000606@perathoner.de> Joyce Wilson wrote: > The change I would like is to have spaces taken into account in the name > sort. > > So we would have something like this: > > Green, Alice > Green, Robert > Greenacre, Janet > Greenjeans, Mr. > > instead of like this: > > Greenacre, Janet > Green, Alice > Greenjeans, Mr. > Green, Robert We can't do that because our database server at ibiblio uses POSIX collation. We cannot ask ibiblio to change that because the server is shared between multiple sites hosted at ibiblio and POSIX is the most general collation. Maybe the next software upgrade will allow us to set collation per database. Re-sorting database output on the web server is impracticable because it would add considerable overhead to the database and web server load. But the most important argument against changing anything is that we dont want to impose the preference of any one user over the rest of the world. There are just too many collation strategies: Classic Spanish treasts 'ch' and 'll' as single letters. Norwegian sorts 'aa' to the top or bottom according to pronunciation. German phonebooks sort '?' as 'oe', but Austrian phonebooks sort '?' after 'o'. Dutch phonebooks sort 'ij' as 'y', but Belgian phonebooks do not. Now, which one should we prefer? -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Mon Sep 21 09:58:36 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Sep 2009 12:58:36 EDT Subject: [gutvol-d] Re: Name lists and Big-endianism Message-ID: walter said: > If I am looking for a book written by > a particular author, I want to be able to > search for his or her name and not for > all books about that particular author. i agree. but that's not the point at issue. the point here is that an unhealthy focus on metadata usually makes one catatonic. > I really do care whether the 'van' part > in my family names is capitalised or not i'm sure you do, walter, and bully for you, which is why it would be a shame if some o.c.d. cataloger made it uppercase for the simple purpose of fitting their sort method. sorting was very important in the old days where we had a _physical_ card-catalog, but it's silly to get bogged down in it today. the o.c.d. people, however, just love to get bogged down with any subject that they can. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Sep 21 10:08:43 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Sep 2009 13:08:43 EDT Subject: [gutvol-d] Re: Name lists and Big-endianism Message-ID: walter said: > And to that particular bird that is usually relegated to my spambox oh, and by the way, there's an important object lesson here... it's fine to put someone in your spam folder -- i do it myself -- but you should then remember that you are _not_ hearing the whole conversation, and therefore probably should refrain from commenting on any bits and pieces of text from the person who you are ignoring, because you're likely to miss something vital, and end up making yourself look silly. and what makes this doubly ironic is that -- when that other person who you are ignoring corrects you -- you won't hear that correction. but everybody else will, so you won't know that the other person made you look silly, but everyone else will know, and then -- down the line -- when everyone else knows that the person had made you look silly, you will make yourself look even sillier when you say you are ignoring them. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Mon Sep 21 10:13:03 2009 From: jimad at msn.com (James Adcock) Date: Mon, 21 Sep 2009 10:13:03 -0700 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: References: Message-ID: >and we see yet another excellent example of how >the "metadata" b.s. is such an unproductive path. >the o.c.d. people love to focus on these minute >details, which make very little difference at all >-- who cares how "van holst" is sorted?, . You make great big assumptions about the nature of the machines that people are reading on, and then make incorrect conclusions based on those assumptions. Yes, if all readers are reading on desktop computers running some flavor of *nix then your conclusions may be correct. But, not all readers of PG books are running *nix, or even desktops. Many of these machines have a very different notion of "sorting" than you have in mind. Which is why we just had this conversation a couple days ago, but, I guess many people didn't get it. On my favorite class of machine, which something like a million+ other readers are reading on, and more every day, "sorts" are typically done on authorlastname, where authorlastname is something provided within the book file. That part which does not correspond to authorlastname is stored by convention in authorfirstname. This sort information is displayed to the reader in one of two ways, both of which ought to appear sensible: Authorlastname, authorfirstname And Authorfirstname authorlastname In either case the actual sort should be on authorlastname This class of machine has no notion of the idea that you can type in part of an authors' name and search on that. Rather all the books on the machine are sorted and displayed in order by authorlastname, and you find a book by scrolling for the authorlastname in sort order within that list. Why does this matter? Consider the famous author name Sun Tzu What is the last name? Sun What is the first name? Well, no one actually knows, but historically "Tzu" which is actually an honorarium is stuck in the authorfirstname slot. But now look what happens: In the authorlastname, authorfirstname case you get: Sun, Tzu Which is not a bad result In the Authorfirstname authorlastname case you get: Tzu Sun Which is an error. Thus, perhaps, one concludes with names where family name needs to display first the encoding has to be: Authorlastname: Sun Tzu Authorfirstname: null In which case both displays work out right. How does one write an automatic algorithm to figure these things out from an existing gut authorlist? Answer, again, is that one can not write an automatic algorithm to figure these things out because currently there isn't enough information stored about author names, and further, how author names are sorted and displayed are based in part on library tradition, perhaps best found by researching Library of Congress for a particular author. Another way of saying this is, let's say you make the mistake of wandering into a Barnes and Noble when you were actually trying to enter the Starbucks next door. But while in there you decide to look at the fiction stacks just for fun to see if they have your favorite author. Where in the stacks do you look? Well, that depends on how B&N sorts on your favorite author, which in turn is based on library tradition for that particular author. Yes you can try to write an algorithm to do this but then you will find that surprisingly often it breaks, because it seems that having an unusual family name is a prereq for writing a book. You can then say "oh well this is PG we really don't care why be o.c.d.?" But then you are producing books that work inferior, in practice, for customers, on customer's machines, compared to the other publishing houses, making PG look like amateur hour. You might say "well then they shouldn't have bought that machine rather they should buy my favorite choice of machine." But customers tend to consider that attitude towards their choice of machine a sign of hostility towards the customer by PG - which I guess is why PG already provides literally about 80 different file formats for customers. I believe PG needs to remain agnostic towards the customers' choice of machine if PG wants to retain the customer, which means that PG needs to understand how the differing classes of machines actually work, and what their constraints are. Getting authors, titles, and sort orders "correct" IS pretty basic. Not easy, but basic. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Sep 21 10:26:14 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Sep 2009 13:26:14 EDT Subject: [gutvol-d] the wings of the dove -- 002 Message-ID: as i said, one of the first steps in doing our little mash-up is to sync up the paragraphs. in doing so, i found a half-dozen mistakes that jim had made in the paragraphing... for others who want to verify these errors, i suggest you look directly at jim's #29452: > http://www.gutenberg.org/files/29452/29452-8.txt in addition, i'll give you the u.r.l. to see the actual scans for each page up on my site... *** in 3 cases, jim missed a paragraph break: > Her response, when it came, was cold but > http://z-m-l.com/go/wotdjp026.html > She put it as to his caring to know > http://z-m-l.com/go/wotdjp263.html > There was a finer > http://z-m-l.com/go/wotdjp317.html *** in another 3 cases, jim incorrectly broke an existing paragraph into two paragraphs. > This was, fortunately for her > http://z-m-l.com/go/wotdjp123.html > What queerer consequence > http://z-m-l.com/go/wotdjp181.html > It just faintly rankled in her > http://z-m-l.com/go/wotdjp201.html *** 6 paragraphing mistakes is not bad performance. with a book containing some 330 pages, like this, i would say that it's probably about an average job. *** besides, my point is never to say "gotcha! errors!" as super-proofer jose menendez has proven, i make my fair share of book-digitizing errors. so that's not the point. there are several big issues that _are_ the point: 1. comparing digitizations is a great way to pinpoint errors so that they can be corrected. i have made this point in repeated examples. 2. most of the books in the library have errors. even the best ones, which were done recently... if you're convinced there are no errors there, you just don't know how to find them, and i strongly suggest you return to the first point. 3. most of the p.g. e-texts will be used only to proof scan-sets that retain the book's structure, and then the p.g. e-text will simply be discarded, since it doesn't contain that important structure. 4. the p.g. plain-text format has a lot of power and beauty inside it, if it's merely extended a bit, which is precisely what i did when i created z.m.l. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Mon Sep 21 10:30:20 2009 From: jimad at msn.com (James Adcock) Date: Mon, 21 Sep 2009 10:30:20 -0700 Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT In-Reply-To: References: Message-ID: >by using a program on your own personal computer. You assume the reader has a personal computer. Some do not. More importantly, many do not at that point in time when they decide they want to choose a new book to read, such as while sitting at an airport waiting for the plane to take off. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richfield at telkomsa.net Mon Sep 21 08:39:40 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Mon, 21 Sep 2009 17:39:40 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de> References: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de> Message-ID: <4AB79E3C.7000306@telkomsa.net> Hi Keith > > > With the development of a structured databese. Which means > that we will have to comprise, that is cover the basic cases and > in certain cases hand edit the fields involved. These special cases > will be harder to find, but there will be a set of rules which will > help us look for them. To make things easier we could use cross- > references as in library catalogues. > > There is no magic bullet. As aexample take look at iTunes. > It has field for sorting Artist. they use a db and for my own > CDs the information is gotten from a diferent DB. I have my own > notion how things should be sorted. So I edit the "sort for Artist" field. > The only problem here is that for classical music sorting/ indexing by > Artist is not viable. I prefer to use the Komposer field. So I have to > use a different index. I take your point, but I reckon that with a bit of definition of canonical fields and formats one should be able to clean the lot up with the exception of cases where previous manual record entry had violated sensible rules. Most of the problems could be cleaned up automatically, and only the horrible examples (basically errors) need get special manual treatment. Trying to construct special rules for your data base to negotiate, would fall foul of the ingenuity of fools. Whether you really need a "formal data base" or not is an open question. Some direct access to properly sorted and indexed files can be startlingly effective. Jon From richfield at telkomsa.net Mon Sep 21 08:29:48 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Mon, 21 Sep 2009 17:29:48 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: References: Message-ID: <4AB79BEC.4010606@telkomsa.net> Really BB! You of all people! You can do better than that! I had assumed that you were IT-savvy. What you say suggests that you may be a DB user, but you sure as Sherridan don't talk like a DB designer, much less a systems designer. Capitalised or not? Weeeellll... maybe if the distinction is built into your software and hard to leave out. Have you considered what difference it makes to the mechanics of sorting, classification, or access? Whether you see it happen or not? For little toy kilo-record files it might be trivial, but we don't all work on those all the time. How anything is sorted? Oh boy... BB, in a certain large corporation which here shall be nameless, I got lumbered with a job of indexing the world-wide email and phone list after some other people repeatedly failed to do it. (Their software tools kept dying when fed the full files.) I wrote the application from scratch with no pain in an unfamiliar language in a few days, partly because I saw to it that a temp got hired to re-format all the names canonically. A year or two later Global HQ decreed a new, commercial-DB-based (Again no names of which large corporation's DB package it was based on!) package, and so we used that instead. Except that the savvy seniors clandestinely loaded and retained my version for years afterward because it was easier to use, more often successful in searching, and faster than the off-the-shelf even when there was a first-time hit. Canonically formatted files are VERY efficiently handleable. But you knew that BB, didn't you? How about this? A certain file-checking job involved cross-checking two files against each other. (Again, never mind which international corporation's files those were!) The job had been manual, but rapidly became infeasible as the files grew. Someone wrote a quick-and-dirty to help, but it took a week to run (5-day week, but still!) and only partly did the job. Someone (maybe the same guy; I don't remember) did the job better, and it ran in a day, still partly successfully. Someone else did a totally different job and it ran in a couple of hours, almost successfully, but it didn't work. Then to get me out of someone's hair I got the job. I began by reformatting the input file every run. Stupid, but whoever expected anything else. Run time, including the sort (Which I also had to write myself) and selection match pass: 49 seconds. Several orders of magnitude improvement in performance plus perfect results. And best of all, it didn't take a lot of sexy programming, just competent design. I probably cold have halved the times for both jobs if I had written in low level code, but it wasn't really necessary. Now BB, I reckon that when proper attention changes a job from not worth running, to so trivial that at first the user thinks that the job hadn't run, it is not a "minute detail, which makes very little difference at all", but a very important detail, which makes enough difference to get management respect -- till the next toughie comes along! You see BB, 'who cares how "van holst" is sorted? --a search for "holst" is gonna find it no matter what you do' is exactly the sort of detail that made the difference in the real life cases. Would you believe, BB, that I could go on for some time in this vain vein? My Gordian (Note the Caps BB!) gnot was nicely productive once I kut it with proper knit-picking design (as in untangling rather than depediculotic activity). Not a louse-egg of "full paralysis" in sight, or in anyone's hair! It is not a matter of bottom-up vs top-down; it is knowing when and why which is appropriate. Cheers, Jon > and we see yet another excellent example of how > the "metadata" b.s. is such an unproductive path. > > the o.c.d. people love to focus on these minute > details, which make very little difference at all > -- who cares how "van holst" is sorted?, or if the > "van" is capitalized or not?, or indeed whether > it is "capitalised" or not?, because a search for > "holst" is gonna find it no matter what you do -- > and, as if this insignificance wasn't bad enough, > such compulsiveness usually causes full paralysis. > > you can tie yourself up worrying about that crap... > or you can cut the gordian knot and be productive. > > -bowerbird > > ------------------------------------------------------------------------ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From richfield at telkomsa.net Mon Sep 21 08:45:24 2009 From: richfield at telkomsa.net (Jon Richfield) Date: Mon, 21 Sep 2009 17:45:24 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <2aef272c5d38121baf7069b06bc76554@xs4all.nl> References: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de> <2aef272c5d38121baf7069b06bc76554@xs4all.nl> Message-ID: <4AB79F94.3060507@telkomsa.net> Hi Walter Generally I agree, though I don't think that most of the extant search algorithms are so sophisticated. Most packages use brute force, relying on fast hardware. "Throwing silicon at the problem." It works to a point, but in data sets that grow far enough to run into exponential problems (even large quadratic problems ftm) a decent design relying on an appropriate algorithm can do nice things for nice people. CU Jon > > Not quite. If I am looking for a book written by a particular author, I > want to be able to search for his or her name and not for all books about > that particular author. Therefore metadata has a, albeit in this era of > sophisticated search algorithms, somewhat reduced, purpose. > > And to that particular bird that is usually relegated to my spambox: I > really do care whether the 'van' part in my family names is capitalised or > not. I'm rather proud of it and do not need beastly pseudonyms to cower > behind. > > Regards, > > Walter > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > > From marcello at perathoner.de Mon Sep 21 10:40:51 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 21 Sep 2009 19:40:51 +0200 Subject: [gutvol-d] Once every 7 years a post of monumental stupidity comes along ... In-Reply-To: References: <4AB6F577.2030105@sbcglobal.net> Message-ID: <4AB7BAA3.5080706@perathoner.de> ... which isn't even funny. Here it is: Keith J. Schultz wrote: >> The change I would like is to have spaces taken into account in the >> name sort. >> >> So we would have something like this: >> >> Green, Alice >> Green, Robert >> Greenacre, Janet >> Greenjeans, Mr. >> >> instead of like this: >> >> Greenacre, Janet >> Green, Alice >> Greenjeans, Mr. >> Green, Robert > Duhhhh !! If this is true there are some people > that ougth to take a course in 101 programming or db > design. It takes about 5 minutes to write the code. And it took the writer of that post no longer than that to ruin his reputation forever. Bowerbird, meet Keith, Keith, meet Bowerbird. Obviously the writer's ignorance about modern web serving infrastructure is complete. Even a single afternoon class about database programming would have taught him enough to keep his mouth shut. The writer of this nonsense obviously does not know that: - To sort a dataset locally on a web server, like the writer proposes, you have to request the whole dataset from the database server. This induces a considerable load on the database server and on the wire. - Sorting on the web server is much slower than sorting on the database server because the database server uses precomputed tables (indexes) which are already sorted, but the web server needs to sort from scratch. So instead of asking the database server to: give me 100 authors sorted by name starting at offset 4500 which the server could almost instantly satisfy out of the pre-sorted index tables, you have to ask the server to give me all authors which are 12800 at present. Instead of reading 100 rows from the disk and passing them over the wire to the web server, you'll end up reading from the disk 12800 rows and transmitting them. Already a factor of 128 times slower. Then comes the gratuitous sorting of 12800 rows on the web server. After which sort we throw away 12700 rows and present the user with the 100 rows she requested. But the ignorance of the writer is not only colossal regarding present day database systems, it becomes even more surrealistic when the writer tries to apply himself to programming. The writer wastes 30 lines of code to re-implement a function that every programming language carries out-of-the-box. That alone would have sufficed to demonstrate that the writer's notions about programming are extremely vague at best. We will furthermore see that the writer used pseudo-code not only to hide his ignorance of any actual programming language, but also to avoid having to test his absurd concoction, which test would have immediately revealed its uttermost bullshittiness even to himself. The absurd proposal of the writer runs thus (feel free to skip to the beef, the irksomeness of this code is just good enough for a smile): > IsEntrySmallertThan(X, Y) :- > Pos := 0; > If (Length(X) < Length (Y)) > then MaxPos = Length(X) -1; > > else MaxPos = Length(Y) - 1; > end if > > While ((IsSmaller := CharSmaller(X[Pos], Y[Pos]) == 0) > and Pos != MaxPos ) > Pos := Pos +1; > end While > > return IsSmaller; > end EntrySmallerThan > > CharAtSmaller(X, Y) :- > If (Cardinal(X) < Cardinal(Y) ) > return 1 > else > If Cardinal(X) > Cardinal(Y) > then return -1; > else return 0; > end if > end if > end CahrAtSmaller > > Cardinal(X) :- > If (X in set of standard Chars) > then return X > else return -1; > end if > end Cardinal > > Put this ipseudo code what language you want and voila. > Cardinal can be made as > complex as you want if you needed finer distinctions. > > regards > Keith. For the sake of playing let us call: IsEntrySmallertThan ('a', 'ab'). MaxPos would then be set to 0. The While loop will call CharSmaller, which does not exist, because the function is called CharAtSmaller. First Bug. CharAtSmaller would then return 0 because it compares 'a' to 'a', which two are equal. The iteration will then stop because Pos == MaxPos == 0. The function will then return. Conclusion: IsEntrySmallertThan ('a', 'ab') returns 0 According to this guy's wisdom, 'a' is not `smallert? than 'ab'. QED Moreover: this code would dump core on you the moment you call it with an empty string, Cardinal (X) returns X or -1 so you'll end up comparing characters with -1, which will not work on machines with unsigned characters ... and so on. Throwing even one line of code over the wall without testing it, is the hallmark of the utter clueless beginner. Even people less full of themselves fall for it sometimes. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Mon Sep 21 10:43:03 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Sep 2009 13:43:03 EDT Subject: [gutvol-d] Re: Name lists and Big-endianism Message-ID: look, jim, you raised some important issues.... which i am willing to talk about. and you raised some unimportant "issues", which i am not all that eager to talk about. so i should just tell you "no" right now, but i'm gonna say it one more time, just for you. > You make great big assumptions about the > nature of the machines that people are reading on i assume you can search the "metadata", yes. (and if you cannot, you need to take that up with someone else, because that is a basic.) so if you want to find "sun tzu", you'd search for "sun tzu", and if that didn't work, then you'd search for "sun" and "tzu" separately... so it wouldn't matter where in the sort order that this record fell, because you could find it. same with marquis de sade and walter van holst and any other name you want to come up with... if you want to read more on this general idea, i would suggest "everything is miscellaneous". > But while in there you decide to > look at the fiction stacks just for fun > to see if they have your favorite author. you're still carrying around a physical mindset -- one which has always been riddled with problems -- when the world has moved to an electronic one. which is why i won't bother to discuss this any more. but those missing italics of yours? i'll discuss those. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Sep 21 10:49:07 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Sep 2009 13:49:07 EDT Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT Message-ID: jim said: > You assume the reader has a personal computer.? yes i do. and i consider the iphone to be a computer. (it has a chip inside it, you know.) even the kindle has a computer chip inside it. it's just too bad that you can't program for it. of course, even if amazon keeps castrating the kindle, it's entirely possible they would put a reader-program on the thing which was capable of rending z.m.l. beautifully... > Some do not. if you don't have a computer, then i simply can't tell you how you would use an e-book. > More importantly, many > do not at that point in time when they decide > they want to choose a new book to read, > such as while sitting at an airport > waiting for the plane to take off. paper books remain delightful, in my eyes... *** jim, you seem to want to argue about all these unimportant things... what about those italics? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Sep 21 10:56:32 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Sep 2009 13:56:32 EDT Subject: [gutvol-d] Re: Name lists and Big-endianism Message-ID: jon said: > Canonically formatted files are VERY efficiently handleable. i know that. > But you knew that BB, didn't you? yes i did. what i do _not_ know is this: who is going to hire the temp whose job it will be to format the p.g. metadata canonically? will it be _you_, jon? ;+) -bowerbird p.s. again, the book is called "everything is miscellaneous"... -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Mon Sep 21 11:47:06 2009 From: jimad at msn.com (James Adcock) Date: Mon, 21 Sep 2009 11:47:06 -0700 Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer In-Reply-To: References: Message-ID: I assure you Bowerbird, that contrary to your comments I did not "deliberately" disfigure the text file, and I would appreciate it if you retract your comments. In any case the "formatters" you refer to would be myself. An army of one. Also, I do not ever rewrap books of my own volition but only as required in order to be accepted for submission by PG. What you see posted by PG is not necessarily the same thing as I would choose to submit to PG, [nor identically that which I did in fact submit to PG] which in my case would probably at this point in time be an HTML, although I can imagine at some point in time with good tools TEI might be more interesting to me. If you are unhappy with HTML as an input submission format then I recommend writing a simple parser for HTML that changes the HTML choice of tags to the tags you prefer. If you wrote such a parser I suspect you could contribute it to PG where it would represent a positive contribution to the many volunteers like myself who would prefer to be submitting in HTML format in the first place. In practice HTML encodes most of what I as a volunteer would choose to spend my time and energy transcribing, but I wish it had a little more power, such as the ability to unambiguously encode authorfirstname, authorlastname, chapter divisions, etc. What I do do for PG represents considerable sacrifice to myself and my family, as I am sure my wife and children would be only too happy to attest. If you think you have something positive to contribute to PG, please do so. Abusing me for my choice of which sacrifices I am willing to make, or not willing to make, does not represent a contribution to PG, nor does it encourage my continuing contributions to PG. The EPUB was not generated by me nor do I have any great knowledge of the EPUB format. I assume that some other volunteer at PG has written a tool to automatically generate EPUB from HTML and that volunteer did so with some choice of margins you do not prefer, or which doesn't work well with your choice of machine. I don't know how to fix this problem, but it does point out the advantages of TEI which allows the encoding in one document the various "hints" necessary for attractive rendering of the one TEI input file into various output rendering language targets. I also did not generate the MOBI, but I use MOBI files all the time with my favorite reader machine. The MOBI that some volunteer at PG, not me, has generated, looks beautiful on my choice of machine, which also allows me to change the size of the font and the margins to my liking, which tends to depend on the time of day - by midnight my eyes get tired and then I tend to like a larger font and smaller margins. Which is why I like reflow formats and reader machines - they allow me to easily "fix" many of the day-to-day "poor choices" that some one else has made which would otherwise get in the way of MY being able to enjoy the book the way *I* want. Presumably this other volunteer DID generate the MOBI file in a way that looked attractive to him or her on his or her choice of machines, which needn't be identical to my preferences - especially since my preferences tend to change with the time of day! My machine also works well with PDF files except I can't fix issues like when the person or process generating the PDF uses a "poor" choice of font, or poor choice of margins when read on my machine. I can sometimes work around these problems by holding my machine in landscape mode, and displaying only half a page of PDF at a time, but it tends to be awkward and painful to hold the machine sideways for a length of time, and PDF often doesn't like to be read a half a page at a time - since it is a page layout language, not a half page layout language. Which is why I tend to prefer reflow formats like MOBI or HTML over PDF. However, at the very least the acidity of Bowerbirds remarks reaffirms my contention that PG needs to allow volunteers like myself to submit files in the volunteer's choice of file formats, NOT Bowerbirds. In which case I could have offered PG my efforts in one file format, and PG could have chosen to accept or reject that offering. If PG chose to accept that offering then hopefully neither Bowerbird nor any other volunteer would abuse me of my efforts which PG has then already acknowledged. Rather, that volunteer would (hopefully) acknowledge that PG had already accepted my contribution, and in turn if they felt they could make further positive contributions to this book, or any other book, in that file format or in any other file format, then they would be free to do so. Unfortunately, there is not a universal sense within the PG community as to what does or does not represent a positive contribution, which in turn leads to that unhappy state of affairs to which Bowerbird is only too aptly demonstrating today. Again I ask consideration that PG seriously consider allowing volunteers to be able to submit books using only ONE file format if they choose to do so, not requiring multiple file formats since that leads to that unhappy state of affairs that Bowerbird is today only too well demonstrating. Better yet, pick YOUR OWN book to transcribe and contribute to PG, rather than abusing ME of MY efforts on MY choice of books! -------------- next part -------------- An HTML attachment was scrubbed... URL: From lee at novomail.net Mon Sep 21 12:13:06 2009 From: lee at novomail.net (Lee Passey) Date: Mon, 21 Sep 2009 13:13:06 -0600 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: <4AB7D042.5010407@novomail.net> James Adcock wrote: [snip] > What I don't understand is why PG continues to be wedded to plain-text as an > *input* encoding format demanded of people submitting texts to PG. > Plain-text is too constrained to do the job well. I find that you are generally correct in everything you have said to date. But the reality is that PG does continue to be wedded to plain (impoverished) text. This topic has come up regularly over the years, and in every case has ended without any improvement to PG. While I hesitate to say that your advocacy is futile, your advocacy is futile. > HTML is too ambiguous, > and too ill-matched to books to do well. We need something else, something > that CAN be correctly and automagically converted "correctly" to one or > another formats including plain-text, and Unicode, and HTML, and mobi, etc. HTML, true standards.) I have concluded that Project Gutenberg is impervious to improvement. While Bowerbird rejects the notion, I am not afraid to say that for what you are attempting to do Project Gutenberg may not be the correct archive. I would suggest, rather, perfecting your HTML file, uploading it to the Internet Archive (http://www.archive.org/create/) and then posting a message here indicating where it can be found if any other volunteer wants to create a degraded version of your master copy. From prosfilaes at gmail.com Mon Sep 21 12:27:42 2009 From: prosfilaes at gmail.com (David Starner) Date: Mon, 21 Sep 2009 15:27:42 -0400 Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer In-Reply-To: References: Message-ID: <6d99d1fd0909211227q765dd187n77dc76cbf85d9dd6@mail.gmail.com> On Mon, Sep 21, 2009 at 2:47 PM, James Adcock wrote: > If you think you have something positive to contribute to PG, please do so. > Abusing me for my choice of which sacrifices I am willing to make, or not > willing to make, does not represent a contribution to PG, nor does it > encourage my continuing contributions to PG. Which is why I've killfiled Bowerbird, and I believe that PG should permanently eject him for their mailing lists. > However, at the very least the acidity of Bowerbirds remarks reaffirms my > contention that PG needs to allow volunteers like myself to submit files in > the volunteer?s choice of file formats, NOT Bowerbirds. That's not a rational argument. Whatever the base file formats are, Project Gutenberg, like most archives, needs to pick one or a small set of them so that the people who use Project Gutenberg can know what they need to read the files. A PG text reader can't be demanded to understand any file that anyone cares to use, and nobody can be expected to understand Word 95 files, and similar garbage that infects indiscriminate archives. ?-- Kie ekzistas vivo, ekzistas espero. From Bowerbird at aol.com Mon Sep 21 14:03:42 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Sep 2009 17:03:42 EDT Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer Message-ID: jim said: > I assure you Bowerbird, that contrary to your comments > I did not ?deliberately? disfigure the text file, and > I would appreciate it if you retract your comments. how come the .txt file is missing the italics which are right there, big as day, in the .html version of the file? did that happen accidentally? did someone else cause them to disappear? if i would have said that you did it "on purpose", would that have made it better? i did say that it was a rather "dramatic" subject-line, but still, since you're the person who submitted the work, i'm not sure how else we can explain the fact that the .html file has italics, but the .txt file does not. and you knew full well that the .txt version was missing the italics information, because that was the impetus that led you here to complain that the .txt format was inferior. it was, in your case, but _only_ because you deliberately made it so... (or, if you prefer, tell me what word you want me to use instead of "deliberately", if that's inaccurate.) > In any case the ?formatters? you refer to would be myself. > An army of one. so you threw away your own work. i guess that that doesn't carry the same moral baggage that throwing away the work of other people might... still, the fact remains that some of the formatting which you included in the .html version of the book was _not_ included in the .txt version, so people who use the .txt version have been deprived of some utility, which does indeed carry some moral baggage, sadly... so i don't think you can excuse the fact that you've thrown away utility, just because you did the work of providing that utility in the .html version of the book. > Also, I do not ever rewrap books of my own volition but > only as required in order to be accepted for submission i didn't complain about your rewrapping of this book. > If you are unhappy with HTML as an input submission > format then I recommend writing a simple parser for HTML > that changes the HTML choice of tags to the tags you prefer. don't try to make this about _me_. or about the _.html_. this thread started because _you_ came here to complain about the _.txt_ format, claiming that it was substandard. it's not. at least not unless it is deliberately sabotaged... (or, if you prefer, tell me what word you want me to use instead of "deliberately", if that's inaccurate.) > If you wrote such a parser I suspect you could contribute > it to PG where it would represent a positive contribution > to the many volunteers like myself who would prefer to be > submitting in HTML format in the first place.? well, it's easy enough to create a .txt file from an .html file -- you simply copy the text out of the browser's window... you'll need to do clean-up on it, since the browser doesn't copy fully-formatted text to the clipboard in most cases... however, you can minimize the work needed by using safari -- which retains the text-styling info -- or internet explorer. it also helps if you colorize blocks, since that will make it easier to reintroduce the indentation lost on those blocks. but really, you're doing it ass-backwards if you're trying to get a .txt file out of an .html file. instead, you should be formatting your .txt file as z.m.l., so you can auto-generate an .html file out of the z.m.l. file -- much less work that way. all the work you spend doing .html is just a waste of energy... furthermore, everyone's .html is different, there's no way that future maintainers of the p.g. library will be able to update it; so they will scrap the .html and use the .txt files as their base. you'd be ahead of the game if you adopted that approach now. > In practice HTML encodes most of what I as a volunteer > would choose to spend my time and energy transcribing well, jim, in your "the wings of the dove" book, there was very little to encode in the first place, so it hardly matters. > but I wish it had a little more power, such as the ability to > unambiguously encode authorfirstname, authorlastname, > chapter divisions, etc. well, i'm not gonna indulge any more o.c.d. on author-names. but as far as "chapter divisions", that's easy to do in .html... indeed, the default understanding of e-book .html markup is that the title and chapter-headers are tagged with "h#"... in z.m.l., the title is assumed to be the file's first paragraph. headers below that are marked with 4 or more empty lines preceding them, and 2 empty lines following them, so that gives you an unambiguous outline of the book's structure... indeed, in "the wings of the dove", there is a 2-level outline, with "book" as the first level and "chapter" as the next level... so you will notice that i indicated that by having 8 empty lines above "book" headers, 5 empty lines above "chapter" headers. > If you think you have something positive to contribute to PG, > please do so.? well, jim, i _do_ think i have "something positive to contribute". and i _am_ contributing it, right now... you're soaking in it... i'm showing _you_ -- and anyone else who wants to read it -- how you could be making yourself much more efficient, _and_ how you could create more beautiful and powerful e-books too. > Abusing me for my choice of which sacrifices I am willing > to make, or not willing to make, does not represent a > contribution to PG, nor does it encourage my continuing > contributions to PG. back off. i'm not "abusing" you at all. i'm pointing out how the choices you've made have resulted in an inferior product. surely you're not going to try and argue that the .txt file that lacks its italics is an acceptable digitization, are you? really? and surely you're not suggesting that i simply _ignore_ that? are you? if you can't take the balmy breeze, get off the patio! > The EPUB was not generated by me and i don't hold you responsible for the .epub. > it does point out the advantages of TEI which allows the > encoding in one document the various ?hints? necessary > for attractive rendering of the one TEI input file into > various output rendering language targets. except that's _not_ a "benefit" of .tei in particular, jim. it's a benefit of _any_ "master" format, including .zml. > I also did not generate the MOBI and i don't hold you responsible for the .mobi. indeed, since mobipocket has never supported the mac, i have absolutely no interest in that format, thank you... > I use MOBI files all the time with my favorite reader machine. good. i'm glad you like it. my z.m.l. workflow calls for output to .html, which can then be converted easily to .mobi, so i have that base covered well enough. > by midnight my eyes get tired and then I tend to like > a larger font and smaller margins. Which is why I like > reflow formats and reader machines ? they allow me to > easily ?fix? many of the day-to-day ?poor choices? that > some one else has made which would otherwise get in > the way of MY being able to enjoy the book the way *I* want. yes, that's the good things about reflowable formats, which is why we like reflowable formats best of all... > My machine also works well with PDF files except I can?t > fix issues like when the person or process generating the PDF > uses a ?poor? choice of font, or poor choice of margins that's the problem with a nonreflowable format like .pdf... of course, if you have the _master_ file, such as a .zml file, and you can customize the .pdf to your _own_ preferences, then the .pdf you generate will be _exactly_ to your liking... (of course, that won't help your time-of-day considerations.) > I can sometimes work around these problems by > holding my machine in landscape mode,?and > displaying only half a page of PDF at a time, but > it tends to be awkward and painful to hold the machine > sideways for a length of time, and PDF often doesn?t > like to be read a half a page at a time ? since it is > a page layout language, not a half page layout language. i can't do much for the uncomfortable sideways position... but i can tell you that, if you're generating the .pdf yourself, from a .zml master, then you could make the pagesize _fit_ the landscape display, so you were reading _full_ pages on it, and not _half_ pages. just another benefit of customized .pdf. > Which is why I tend to prefer reflow formats like MOBI or HTML right. > However, at the very least the acidity of Bowerbirds remarks oh please jim. does everyone coddle your precious identity? aren't you used to anyone being frank with you in the slightest? i haven't called you any names, or cast any aspersions on you... even my claim that you had "deliberately sabotaged" the .txt file was something that i myself said was "dramatic", even if accurate, and it was a description of your _behavior_, not your _personality_. > reaffirms my contention that PG needs to allow volunteers like > myself to submit files in the volunteer?s choice of file formats, > NOT Bowerbirds. i'm not trying to tell p.g. what to do. and neither should you, jim... > In which case I could have offered PG my efforts in one file format, > and PG could have chosen to accept or reject that offering.? you can do that now. they will choose not to accept it. live with it. > If PG chose to accept that offering then hopefully neither Bowerbird > nor any other volunteer would abuse me of my efforts > which PG has then already acknowledged.? look, if you omit the italics from a book, i'm gonna call you on it... (and if you choose to see that as "abuse", then that's your problem.) i don't give a flying burrito if p.g. has "acknowledged" your work or not; if you left out the italics from the book, i'll call you on it... > Rather, that volunteer would (hopefully) acknowledge that PG > had already accepted my contribution, and in turn if they felt > they could make further positive contributions to this book, > or any other book, in that file format or in any other file format, > then they would be free to do so. you're registered over at distributed proofreaders, jim, so why don't you see if anyone over there will collaborate with you and do the parts of the job that you don't want to do? that would be far more effective than coming here and asking p.g. to accept half of what they want, because you just don't wanna do the other half. > Unfortunately, there is not a universal sense within the PG community > as to what does or does not represent a positive contribution, which > in turn leads to that unhappy state of affairs to which Bowerbird > is only too aptly demonstrating today. what is this "unhappy state of affairs" to which you make reference? i'm unhappy because you left out the italics on a book you digitized, and p.g. accepted it anyway, probably because they didn't notice it, probably because they didn't think anyone would do something so stupid as to put the italics in the .html file and not in the .txt file... > Again I ask consideration that PG seriously consider allowing > volunteers to be able to submit books using only ONE file format you _can_ submit _one_ file format. they'll accept a .txt file alone. no need for the .html file, or for any other format, for that matter. > if they choose to do so, not requiring multiple file formats since > that leads to that unhappy state of affairs that Bowerbird is today > only too well demonstrating. not only are you spouting nonsense about "unhappy state of affairs", you're _repeating_ it, and in the very next paragraph no less! weird! > Better yet, pick YOUR OWN book to transcribe and contribute to PG, > rather than abusing ME of MY efforts on MY choice of books! you can repeat that "abuse" line all you want, jim, but as long as your book is missing those italics, you are the one in the wrong. but hey, no problem, i'm gonna fix your work -- correct your flaws -- and submit a _corrected_ version of your .txt file, with all the italics... but don't expect me to fix _all_ of your books! -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua at hutchinson.net Mon Sep 21 14:40:46 2009 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Mon, 21 Sep 2009 21:40:46 +0000 (GMT) Subject: [gutvol-d] Re: PG French text file #1500 Message-ID: <39275292.87509.1253569246873.JavaMail.mail@webmail11> An HTML attachment was scrubbed... URL: From lee at novomail.net Mon Sep 21 15:15:21 2009 From: lee at novomail.net (Lee Passey) Date: Mon, 21 Sep 2009 16:15:21 -0600 Subject: [gutvol-d] Re: PG French text file #1500 In-Reply-To: <39275292.87509.1253569246873.JavaMail.mail@webmail11> References: <39275292.87509.1253569246873.JavaMail.mail@webmail11> Message-ID: <4AB7FAF9.8050309@novomail.net> Joshua Hutchinson wrote: > Just at a quick glance, it looks like any harvesters would need to track down > original scans to clear it through PG's normal clearing routines. If your interested, you could contact Yann Forget, http://www.forget-me.net/; He's the one that did the original post to wikimedia. As Alexis de Tocqueville died in 1859, I'm guessing the work is out of copyright. From traverso at posso.dm.unipi.it Mon Sep 21 22:29:46 2009 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Tue, 22 Sep 2009 07:29:46 +0200 (CEST) Subject: [gutvol-d] Re: PG French text file #1500 In-Reply-To: <39275292.87509.1253569246873.JavaMail.mail@webmail11> (message from Joshua Hutchinson on Mon, 21 Sep 2009 21:40:46 +0000 (GMT)) References: <39275292.87509.1253569246873.JavaMail.mail@webmail11> Message-ID: <20090922052946.BF28C10138@cardano.dm.unipi.it> Currently Tocqueville is in proof at DP, in 4 parts between P2 and P3. It might be fast-tracked if PG wants it, but dozens of french projects might come before anyway. Carlo From hart at pobox.com Tue Sep 22 02:20:15 2009 From: hart at pobox.com (Michael S. Hart) Date: Tue, 22 Sep 2009 02:20:15 -0700 (PDT) Subject: [gutvol-d] Re: PG French text file #1500 In-Reply-To: <20090922052946.BF28C10138@cardano.dm.unipi.it> References: <39275292.87509.1253569246873.JavaMail.mail@webmail11> <20090922052946.BF28C10138@cardano.dm.unipi.it> Message-ID: There are still 14 more French eBooks to go, so I should hope we can get this one done in time to be #1500, please give it a go. Thanks!!! Michael On Tue, 22 Sep 2009, Carlo Traverso wrote: > > Currently Tocqueville is in proof at DP, in 4 parts between P2 and > P3. It might be fast-tracked if PG wants it, but dozens of french > projects might come before anyway. > > Carlo > > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From schultzk at uni-trier.de Wed Sep 23 01:38:59 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 23 Sep 2009 10:38:59 +0200 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <4AB79E3C.7000306@telkomsa.net> References: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de> <4AB79E3C.7000306@telkomsa.net> Message-ID: <9F4CE293-9C76-4C50-BE5F-E5EA319CD048@uni-trier.de> Hi Jon, Am 21.09.2009 um 17:39 schrieb Jon Richfield: > Hi Keith >> > > I take your point, but I reckon that with a bit of definition of > canonical fields and formats one should be able to clean the lot up > with the exception of cases where previous manual record entry had > violated sensible rules. Most of the problems could be cleaned up > automatically, and only the horrible examples (basically errors) > need get special manual treatment. Trying to construct special rules > for your data base to negotiate, would fall foul of the ingenuity of > fools. > > Whether you really need a "formal data base" or not is an open > question. Some direct access to properly sorted and indexed files > can be startlingly effective. Basically, I was not saying we need a "formal database" or system. The fact is information in the files basically constitute a database, albeit the information is structured. As I mentioned due to restrisction defined for the metadata the desired features are not possible in the present form and could be easily overcome. regards Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Wed Sep 23 02:11:34 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 23 Sep 2009 11:11:34 +0200 Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity comes along ... In-Reply-To: <4AB7BAA3.5080706@perathoner.de> References: <4AB6F577.2030105@sbcglobal.net> <4AB7BAA3.5080706@perathoner.de> Message-ID: <49E5C390-DDAA-4D2C-93E3-963D7A34FCE0@uni-trier.de> Hi Marcello, NOW who is laughing an has more egg in his face. 1) A very piss poor DB-server that has to reindex for every request 2) An even more piss poor programmer that can NOT do it! ALL your argumernts are MUTE. It is interesting that companies have costumer databases, databases for thier employees changing, data requesting data and all this is based a databases. Furthermore, what kind of database system are you using that has to role in all that information. Last and not least If you could understand have I said you would have understood that the information retrieved beiing presented to the user can be sorted!! You have a complete lack of programming and system engineering. I figured you would complain about my pseudo code (you do know what that is?) I simply just wrote it done without any correction or checking for typos. I put in more than I wanted to so that the simplest minds could unsterstand the basic simplicity of doing the task. True, I should have use "or" instead of "and" . I also, admit that I did forget the length check. But, again it was just to show how easy it can be done. But, I you consider i just wrote this down whithout any afterthought or thought at it took me less that a minute leaving me four minutes to clear up the rest. I could have been abstract about it and left the code out. Besides, I would for myself make it far more elaborate to account for encodings and different languages. Am 21.09.2009 um 19:40 schrieb Marcello Perathoner: > ... which isn't even funny. Here it is: > > Keith J. Schultz wrote: > >>> The change I would like is to have spaces taken into account in >>> the name sort. >>> >>> So we would have something like this: >>> >>> Green, Alice >>> Green, Robert >>> Greenacre, Janet >>> Greenjeans, Mr. >>> >>> instead of like this: >>> >>> Greenacre, Janet >>> Green, Alice >>> Greenjeans, Mr. >>> Green, Robert > > >> Duhhhh !! If this is true there are some people >> that ougth to take a course in 101 programming or db >> design. It takes about 5 minutes to write the code. > > > And it took the writer of that post no longer than that to ruin his > reputation forever. > > Bowerbird, meet Keith, Keith, meet Bowerbird. > > > Obviously the writer's ignorance about modern web serving > infrastructure is complete. Even a single afternoon class about > database programming would have taught him enough to keep his mouth > shut. The writer of this nonsense obviously does not know that: > > - To sort a dataset locally on a web server, like the writer > proposes, you have to request the whole dataset from the database > server. This induces a considerable load on the database server and > on the wire. > > - Sorting on the web server is much slower than sorting on the > database server because the database server uses precomputed tables > (indexes) which are already sorted, but the web server needs to sort > from scratch. > > So instead of asking the database server to: > > give me 100 authors sorted by name starting at offset 4500 > > which the server could almost instantly satisfy out of the pre- > sorted index tables, you have to ask the server to > > give me all authors > > which are 12800 at present. > > > Instead of reading 100 rows from the disk and passing them over the > wire to the web server, you'll end up reading from the disk 12800 > rows and transmitting them. Already a factor of 128 times slower. > > Then comes the gratuitous sorting of 12800 rows on the web server. > After which sort we throw away 12700 rows and present the user with > the 100 rows she requested. > > > But the ignorance of the writer is not only colossal regarding > present day database systems, it becomes even more surrealistic when > the writer tries to apply himself to programming. > > The writer wastes 30 lines of code to re-implement a function that > every programming language carries out-of-the-box. That alone would > have sufficed to demonstrate that the writer's notions about > programming are extremely vague at best. > > We will furthermore see that the writer used pseudo-code not only to > hide his ignorance of any actual programming language, but also to > avoid having to test his absurd concoction, which test would have > immediately revealed its uttermost bullshittiness even to himself. > > The absurd proposal of the writer runs thus (feel free to skip to > the beef, the irksomeness of this code is just good enough for a > smile): > > >> IsEntrySmallertThan(X, Y) :- >> Pos := 0; >> If (Length(X) < Length (Y)) >> then MaxPos = Length(X) -1; >> else MaxPos = Length(Y) - 1; >> end if >> While ((IsSmaller := CharSmaller(X[Pos], Y[Pos]) == 0) > > and Pos != MaxPos ) >> Pos := Pos +1; >> end While > > > > return IsSmaller; > > end EntrySmallerThan > > > > CharAtSmaller(X, Y) :- > > If (Cardinal(X) < Cardinal(Y) ) > > return 1 > > else > > If Cardinal(X) > Cardinal(Y) > > then return -1; > > else return 0; > > end if > > end if > > end CahrAtSmaller > > > > Cardinal(X) :- > > If (X in set of standard Chars) > > then return X > > else return -1; > > end if > > end Cardinal > > > > Put this ipseudo code what language you want and voila. > > Cardinal can be made as > > complex as you want if you needed finer distinctions. > > > > regards > > Keith. > > > For the sake of playing let us call: > > IsEntrySmallertThan ('a', 'ab'). > > MaxPos would then be set to 0. > > The While loop will call CharSmaller, which does not exist, because > the function is called CharAtSmaller. First Bug. > > CharAtSmaller would then return 0 because it compares 'a' to 'a', > which two are equal. The iteration will then stop because Pos == > MaxPos == 0. The function will then return. > > Conclusion: > > IsEntrySmallertThan ('a', 'ab') returns 0 > > According to this guy's wisdom, 'a' is not `smallert? than 'ab'. > > QED > > Moreover: this code would dump core on you the moment you call it > with an empty string, Cardinal (X) returns X or -1 so you'll end up > comparing characters with -1, which will not work on machines with > unsigned characters ... and so on. > > > Throwing even one line of code over the wall without testing it, is > the hallmark of the utter clueless beginner. Even people less full > of themselves fall for it sometimes. > > > > > -- > Marcello Perathoner > webmaster at gutenberg.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From Bowerbird at aol.com Wed Sep 23 02:46:29 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Sep 2009 05:46:29 EDT Subject: [gutvol-d] tuesday, september 22, the first day of fall Message-ID: sometimes you have to timestamp a development... or two... or three... *** new reader-machines are being announced _daily_, it seems. today's model is a joint venture between best buy and verizon. yes, i know, i know. how are best buy and verizon gonna wrangle book-buyers away from amazon dot com, the web super-retailer, and the bookstore named after the world-class river? i have no idea. and you can bet they have no idea either. look for two dozen reader-machines to debut in 2009. look for one dozen of them to be dead by the year-end. the big winner, certainly, will be adobe and its d.r.m. the big loser, ironically, will be adobe and its d.r.m., because you know adobe ain't gonna be able to pull off a d.r.m. scheme with dozens of no-experience partners, and the resultant fiasco will be endlessly entertaining... plus it's bound to piss off all kinds of paying customers, and the righteous indignation promises to be amusing... the kindle, of course, won't be impacted in the slightest by all this downmarket warfare amongst the ranks, but second-runner sony might get hit by some of the gunfire. *** speaking of sony... saw it the first time on saturday night. an advertisement for the sony reader, on broadcast television in prime time. with justin tumberlank _and_ payton manning. not to mention the world's fastest speedreader. saw it again the next time on sunday night. on network television during the emmy awards. with justin tomberland _and_ payton manning. not to mention the world's fastest speedreader. saw it again the next time on monday night. during some very big season premiere shows. with justin timberlake _and_ payton manning. not to mention the world's fastest speedreader. these are some serious media buys at high prices, folks, and a sign that sony isn't just fooling around. they're fooling around _and_ blowing an ad budget. but at least we get a sense they think it's important. ads note the (introductory model) price of $199, as well as product highlight that the machine can store hundreds of books. who woulda thunk that these machines woulda hit the television waves? *** in other news, a guy name "brown" sold some books. *** happy autumnal equinox, folks... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua at hutchinson.net Wed Sep 23 05:37:22 2009 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Wed, 23 Sep 2009 12:37:22 +0000 (GMT) Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity comes along ... Message-ID: <1895795374.134329.1253709442237.JavaMail.mail@webmail07> An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Sep 23 10:17:09 2009 From: prosfilaes at gmail.com (David Starner) Date: Wed, 23 Sep 2009 13:17:09 -0400 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: References: Message-ID: <6d99d1fd0909231017v24382285iaa082094c1624821@mail.gmail.com> On Mon, Sep 21, 2009 at 1:13 PM, James Adcock wrote: > Why does this matter?? Consider the famous author name Sun Tzu Let's consider it; why do you think the general audience will search for Sun Tzu and not Tzu, Sun? A system that just gives an unsearchable list of names and doesn't have Tzu, Sun, even if only as an alias, is unusable, correct or not. Not to mention that his name is S?n Z?, or ??, or ?? or Sunzi, and that doesn't even start to approach the problem of spelling questions. -- Kie ekzistas vivo, ekzistas espero. From jimad at msn.com Wed Sep 23 11:30:19 2009 From: jimad at msn.com (Jim Adcock) Date: Wed, 23 Sep 2009 11:30:19 -0700 Subject: [gutvol-d] Re: the wings of the dove -- 002 In-Reply-To: References: Message-ID: Thank you Bowerbird, again, for making my points for me: 1) If I had submitted this book instead to DP there would have been a much larger number of punc errors introduced as "required" by the DP process. 2) We would all still be waiting for this book, because I prior submitted two books to DP after a considerable amount of work on my part and they have still to see the light of day. Someone with a practical knowledge of queuing theory needs to go over these issues with DP. 3) I know perfectly well that errors remain unseen, which is why I would like an input file format that easily allows another motivated volunteer to pick up where I left off when my children start complaining that they are unfed and unclothed and "reality calls" -- besides which by the time I am "done" with a book like "Dove" I am splitting blood and ready to do something else for a while -- rather than listening to Bowerbird insult my efforts and insult my integrity simply because I do not support his favored hack markup schemes -- which no one else wants to support either. From prosfilaes at gmail.com Wed Sep 23 12:17:03 2009 From: prosfilaes at gmail.com (David Starner) Date: Wed, 23 Sep 2009 15:17:03 -0400 Subject: [gutvol-d] Re: the wings of the dove -- 002 In-Reply-To: References: Message-ID: <6d99d1fd0909231217s30902b15jc589f865ce184bc1@mail.gmail.com> On Wed, Sep 23, 2009 at 2:30 PM, Jim Adcock wrote: > Thank you Bowerbird, again, for making my points for me: No, he's not, because nobody is listening to him. I considering killfilling you, too, because I no more want to hear from Bowerbird by proxy than directly. Stop complaining about what he does; he's a troll, he enjoys it. Just stop reading his messages. -- Kie ekzistas vivo, ekzistas espero. From jimad at msn.com Wed Sep 23 12:56:07 2009 From: jimad at msn.com (James Adcock) Date: Wed, 23 Sep 2009 12:56:07 -0700 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: References: Message-ID: >so if you want to find "sun tzu", you'd search for "sun tzu", and if that didn't work, then you'd search for "sun" and "tzu" separately... Okay, let's get specific. My favorite machine lists these things alphabetical by authorlastname, authorfirstname and I currently on my favorite machine have about 100 books on it whereas on the previous generation machine which I use less nowadays I have about 500 books. So I get to scroll through the list of books three times to perform your "search algorithm" example. But, more importantly, in the case of a reader to picks up e-books from PG and from other publishing houses, say someone who wants to collect and read everything ever written by Sir Arthur Conan Doyle, finds that his or her e-book library instead of being correctly sorted and cataloged by author now finds instead Sir Arthur Conan Doyle spread out at about five factorial locations on his or her e-book bookshelf. Or more likely, Sherlock ends up all in one place if from one of a variety of professional publishing house, and at another location if the e-book is coming from PG. Or god knows where if purchased from Amazon "published" there by one of an infinite number of bottom feeding garage shops. And why am I "o.c.d." on these issues? Because I have converted a few tens of thousands of PG books to e-book format and have found, *in practice*, that the issue of author names and how to "correctly" extract them from the data PG provides - or doesn't provide - in practice, not in theory, ends up being one of the real stumbling blocks. Certainly an extensible format like TEI, if it contained correctly coded authorlastname, authorfirstname information, would make extraction of correct "spine" information trivial. Then the problem reduces to how in the PG system to get a "correct" canonical form of authorlastname, authorfirstname, and the answer is some real human being has to do that research -- which is perhaps most appropriately done as part of the copyright clearance process which I think frequently refers to LoC in the first place? Or as another simply example of these issues based on an author I have recently worked on, enter "James Henry" in the PG home page author slot, and compare what you get to when you enter "Henry James" and then try "Henry, James" and then try "James, Henry" and then rationalize to the readers of this list your results and why those results are the "correct" result ??? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Sep 23 13:24:20 2009 From: jimad at msn.com (James Adcock) Date: Wed, 23 Sep 2009 13:24:20 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: <4AB7D042.5010407@novomail.net> References: <4AB7D042.5010407@novomail.net> Message-ID: >> What I don't understand is why PG continues to be wedded to plain-text as an >> *input* encoding format demanded of people submitting texts to PG. >> Plain-text is too constrained to do the job well. > >I find that you are generally correct in everything you have said to >date. But the reality is that PG does continue to be wedded to >plain (impoverished) text. I have heard reasonable rational (whether one agrees or not) why PG remains wedded to PG TXT format as an OUTPUT file format. I have not heard a reasonable rational why PG REQUIRES me to submit BOTH an HTML AND a PG TXT file if what I as a volunteer really want to submit is just an HTML file. If I were allowed to just submit an HTML file then I could reasonably encode MOST of what I as a transcriber would like to transcribe, and I could avoid the abuse that I currently receive from Bowerbird when I don't put in the extraneous marks and spaces and smiley faces not found in the author's work but which Bowerbird would like to see in the PG TXT in order to support his pet theories about how the input file format and the rendered file format need to be one and the same thing. In turn Bowerbird could use his time and energies in a positive manner transcribing my HTML input format file into any particular flavor of PG TXT output file format that Bowerbird likes and can and will in turn pat himself on the back for, rather than abusing me of efforts that I didn't want to have to do in the first place. >For PG to adopt such a scheme, however, would require that PG adopt a set of Standards... How about a VOLUNTARY set of "suggested" standards for HTML, such that when a volunteer voluntarily codes to those HTML standards the results can be translated and displayed on a larger class of machines successfully? Certainly PG in practice already enforces a number of standards on submitted input files which if you don't follow your files don't get accepted -- even though those standards aren't really written down so one ends up having to rework one's submissions not infrequently in order to get them accepted -- surprise! >I have concluded that Project Gutenberg is impervious to improvement. I don't think its impervious to improvement, it's just that changes are very slow to come and very hard won. Certainly from my point of view the recent decision to support, or at least partially support, EPUB and MOBI has made my life much more enjoyable. >I would suggest, rather, perfecting your HTML file, uploading it to the Internet Archive (http://www.archive.org/create/) and then posting a message here indicating where it can be found if any other volunteer wants to create a degraded version of your master copy. Sigh -- I would hate to think that I have to "route around damage" -- again. From prosfilaes at gmail.com Wed Sep 23 13:34:03 2009 From: prosfilaes at gmail.com (David Starner) Date: Wed, 23 Sep 2009 16:34:03 -0400 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: <4AB7D042.5010407@novomail.net> Message-ID: <6d99d1fd0909231334j49f3f342u8052b9c48f629097@mail.gmail.com> On Wed, Sep 23, 2009 at 4:24 PM, James Adcock wrote: > I could avoid > the abuse that I currently receive from Bowerbird That's like an abused woman saying that if she just had a better dishwasher her husband would stop hitting her. Bowerbird will abuse you no matter what. At some point, it's your fault for not putting him in a killfile. -- Kie ekzistas vivo, ekzistas espero. From gbnewby at pglaf.org Wed Sep 23 13:35:40 2009 From: gbnewby at pglaf.org (Greg Newby) Date: Wed, 23 Sep 2009 13:35:40 -0700 Subject: [gutvol-d] Re: PG French text file #1500 In-Reply-To: References: <39275292.87509.1253569246873.JavaMail.mail@webmail11> <20090922052946.BF28C10138@cardano.dm.unipi.it> Message-ID: <20090923203540.GA8486@pglaf.org> On Tue, Sep 22, 2009 at 02:20:15AM -0700, Michael S. Hart wrote: > > > There are still 14 more French eBooks to go, > so I should hope we can get this one done in > time to be #1500, please give it a go. Carlo: If this could be fast-tracked, it would be great. We would love to plan on Toqueville as French #1500. Can you let Michael and I know if this seems likely, so we can plan accordingly? Thanks much. -- Greg > > Thanks!!! > > > Michael > > > > > On Tue, 22 Sep 2009, Carlo Traverso wrote: > > > > > Currently Tocqueville is in proof at DP, in 4 parts between P2 and > > P3. It might be fast-tracked if PG wants it, but dozens of french > > projects might come before anyway. > > > > Carlo From ajhaines at shaw.ca Wed Sep 23 13:35:58 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Wed, 23 Sep 2009 13:35:58 -0700 Subject: [gutvol-d] Re: the wings of the dove -- 002 References: Message-ID: Jim, I would suggest that if you're spitting blood by the time you finish a book that you're going at it too fast/forcefully. Slow down--there are no deadlines at PG. Re bowerbird - ignore him/it. Few, if any, aspects of PG (or DP) satisfy him/it, while little, if anything, that him/it does, satisfies anyone else. Al ----- Original Message ----- From: "Jim Adcock" To: "'Project Gutenberg Volunteer Discussion'" Sent: Wednesday, September 23, 2009 11:30 AM Subject: [gutvol-d] Re: the wings of the dove -- 002 > Thank you Bowerbird, again, for making my points for me: > > 1) If I had submitted this book instead to DP there would have been a much > larger number of punc errors introduced as "required" by the DP process. > > 2) We would all still be waiting for this book, because I prior submitted > two books to DP after a considerable amount of work on my part and they > have > still to see the light of day. Someone with a practical knowledge of > queuing > theory needs to go over these issues with DP. > > 3) I know perfectly well that errors remain unseen, which is why I would > like an input file format that easily allows another motivated volunteer > to > pick up where I left off when my children start complaining that they are > unfed and unclothed and "reality calls" -- besides which by the time I am > "done" with a book like "Dove" I am splitting blood and ready to do > something else for a while -- rather than listening to Bowerbird insult my > efforts and insult my integrity simply because I do not support his > favored > hack markup schemes -- which no one else wants to support either. > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From jimad at msn.com Wed Sep 23 13:42:14 2009 From: jimad at msn.com (James Adcock) Date: Wed, 23 Sep 2009 13:42:14 -0700 Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer In-Reply-To: <6d99d1fd0909211227q765dd187n77dc76cbf85d9dd6@mail.gmail.com> References: <6d99d1fd0909211227q765dd187n77dc76cbf85d9dd6@mail.gmail.com> Message-ID: >Whatever the base file formats are, Project Gutenberg, like most archives, needs to pick one or a small set of them so that the people who use Project Gutenberg can know what they need to read the files. Again, this is confusing input file formats with output file formats. PG could choose to allow HTML as an acceptable input file format because PG can easily write a tool to convert HTML to their choice of PG TXT file format, including standardizing on such issues as whether italics ought to be rendered in PG TXT files as *star* or +plus+ or _underscore_ or SHOUT or better yet maybe PG could allow these kinds of choices to be made by an output filter so that text readers for the blind could have something more compatible with their prosodic emphasis machines, or better yet maybe the output filters could actually implement some of the "proper" prosodic emphasis markings for the more popular blind reader machines in order to maximize their capabilities. In my experience what happens is just the opposite of what you might expect -- rather the first time user of PG picks up a PG TXT file because they think that represents the "lowest common denominator" for their machine and so they think "it must surely work" and what they find instead is that what gets displayed on their machine is a total hash of line breaks in non-sensible locations, and random garbage marks, and then they conclude PG is archaic brain dead stuff by people who are clueless and they give up and go away. Or alternatively they post stupid stuff on public forums like "gee I like all these free books from PG and I read them all the time even though they have these random line-breaks stuck in all over the place" -- which in turn makes the efforts of the PG volunteers look like clueless idiots. There are other sites which take PG texts and do intelligent things like "tell me what kind of machine you are reading on and I will suggest which of the many file formats will probably display to your liking on your machine" which I think in practice tends to result in happier customers. Right now PG is still basically assuming that the average PG "customer" is a die-hard hacker running some flavor of a *nix machine in a college environment. Which is probably [somewhat] true of the people submitting books, but not at all true of the people who would just like to read them. From jimad at msn.com Wed Sep 23 15:17:23 2009 From: jimad at msn.com (Jim Adcock) Date: Wed, 23 Sep 2009 15:17:23 -0700 Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer In-Reply-To: References: Message-ID: >how come the .txt file is missing the italics which are right there, big as day, in the .html version of the file? Presumably I followed the instructions on the PG website by: 1) typing in "italic" in the big red "search site term" box Then, 2) typing in "HTML" in the big red "search site term" box, finding the "HTML FAQ" and following the suggestions there in H.12 From jimad at msn.com Wed Sep 23 18:12:52 2009 From: jimad at msn.com (Jim Adcock) Date: Wed, 23 Sep 2009 18:12:52 -0700 Subject: [gutvol-d] Re: Name lists and Big-endianism In-Reply-To: <6d99d1fd0909231017v24382285iaa082094c1624821@mail.gmail.com> References: <6d99d1fd0909231017v24382285iaa082094c1624821@mail.gmail.com> Message-ID: >Let's consider it; why do you think the general audience will search for Sun Tzu and not Tzu, Sun? A system that just gives an unsearchable list of names and doesn't have Tzu, Sun, even if only as an alias, is unusable, correct or not. I assure you that I and about a million other people have for our primary reading machines a machine which only provides a library of books sorted and listed by authorlastname and which does not in fact have a "search" capability on authornamepart and while I agree with you that I would prefer a machine with a stronger search capability the reason that we put up with this machine is that it is so many light years ahead of other machines that we might want to read on as to make that decision a "no brainer" -- even given the shortcomings of the user shell design. In fact after "putting up with" computers and having to print out documents for the last 35 years of my life I now find that I almost never print out anything, and I almost never buy a book or magazine in print anymore. And the machine goes with me everywhere and I read it every night in bed until I fall asleep. So this is by far my most useful most favorite machine I have ever had in my life. But then again, I do a LOT of reading! A better counter question is why would PG WANT to implement a system that prevents easy and correct implementation of common e-book formats? -- EPUB and MOBI ? From jimad at msn.com Wed Sep 23 18:41:34 2009 From: jimad at msn.com (Jim Adcock) Date: Wed, 23 Sep 2009 18:41:34 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: <6d99d1fd0909231334j49f3f342u8052b9c48f629097@mail.gmail.com> References: <4AB7D042.5010407@novomail.net> <6d99d1fd0909231334j49f3f342u8052b9c48f629097@mail.gmail.com> Message-ID: LOL ok you win your point! I will attempt to filter him out. From Bowerbird at aol.com Wed Sep 23 21:34:37 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 24 Sep 2009 00:34:37 EDT Subject: [gutvol-d] Re: the wings of the dove -- 002 Message-ID: al said: > Re bowerbird - ignore him/it.? i'm male, al. if you had been at the december 2003 meet-up, where we celebrated the first 10,000 p.g. e-texts, you would have met me, so you woulda known then. but i'm sure you probably knew that anyway, and were just using the "it" form as a mechanism of dehumanization. > Few, if any, aspects of PG (or DP) satisfy him/it, don't be ridiculous. i love michael hart, the soul of p.g., the man who birthed the project, nurtured it to adulthood. the man's a saint, and he's smart too, with his solid focus on the text format as the backbone, to ensure a lifelong viability. i love all the volunteers, who have generously donated so much of their time and energy and money to digitize these old books, and persisted in spite of the shitty schooling and tools they had, working inside workflows that wasted their time on terrible design. i love the world, who embraced the project gutenberg library early, making it the premiere cyberlibrary, beating back half-assed efforts by other people who were enamored of some gimmick or another, whether in the form of proprietary formats or open-source snake-oil. i love the faq, which lay down some good advice on the .txt format, even if the whitewashers don't do any checks to ensure compliance. i love david widger for all the hard work he's done over the years... i love greg newby for offering webspace to anyone who needs it... i love the d.p. people who give me support behind the enemy lines. i love the d.p. people (lucy!) who support me in _front_ of those lines. i love the d.p. people who keep working on improving the .html format. i love thundergnat, for giving the d.p. people a tool they can use... i love all the other programmers who said "phuck you" to d.p. because programmers shouldn't hang out in a place where we ain't appreciated. i love the guy who programmed "eucalyptus" and thereby proved that you can create _beautiful_ e-books on the iphone from p.g. e-texts. i love the guy who programmed "eucalyptus" because he left me room to prove that you can make those iphone e-books _powerful_ as well... i'm sure there's more, but that's probably enough off the noggin... > while little, if anything, that him/it does, satisfies anyone else. "that him does" -- didn't think that one through, did you al? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 23 21:37:15 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 24 Sep 2009 00:37:15 EDT Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) Message-ID: jim said: > I have not heard a reasonable rational why PG REQUIRES me > to submit BOTH an HTML AND a PG TXT file if what I as a volunteer > really want to submit is just an HTML file. > If I were allowed to just submit an HTML file then I could reasonably > encode MOST of what I as a transcriber would like to transcribe, > and I could avoid the abuse that I currently receive from Bowerbird > when I don't put in the extraneous marks and spaces and smiley faces > not found in the author's work but which Bowerbird would like to see > in the PG TXT in order to support his pet theories about how > the input file format and the rendered file format need to be one > and the same thing. In turn Bowerbird could use his time and > energies in a positive manner transcribing my HTML input format file > into any particular flavor of PG TXT output file format that > Bowerbird likes and can and will in turn pat himself on the back for, > rather than abusing me of efforts that I didn't want to have to do > in the first place. i won't let you bait me into any more of this nonsense, jim. everyone paying attention -- and probably even most of those _not_ paying attention -- knows that i refuted your points deftly, and completely, except for those which i myself have already made. (but, um, gee, thanks for all your _support_ on those matters, jim; having you agreeing with them really bolstered up their credibility.) you come here looking for a master format. i handed you one. but because it doesn't look the way you thought it _would_ look, you don't recognize that it's exactly what you were looking for... there's a certain bit of humorous irony in all that... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 23 22:15:25 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 24 Sep 2009 01:15:25 EDT Subject: [gutvol-d] let's wind this down Message-ID: jim said: > 1) If I had submitted this book instead to DP there > would have been a much larger number of punc > errors introduced as "required" by the DP process. what? the d.p. process requires the introduction of errors? > 2) We would all still be waiting for this book, > because I prior submitted two books to DP after > a considerable amount of work on my part and > they have still to see the light of day. Someone > with a practical knowledge of queuing theory > needs to go over these issues with DP. you might want to discuss queuing theory in the d.p. forums. i suppose they would get a lot of good out of that discussion. > 3) I know perfectly well that errors remain unseen, > which is why I would like an input file format that > easily allows another motivated volunteer to > pick up where I left off when my children start complaining > that they are unfed and unclothed and "reality calls" i suppose i've already told you that z.m.l. does just that. i even mounted your very book, so that you could see it. so i don't suppose it'd do any good to repeat it again now. > rather than listening to Bowerbird insult my efforts i pointed out that your .txt version was missing the italics. if you consider that to be an "insult", your skin is too thin... > and insult my integrity simply because > I do not support his favored hack markup schemes no, you failed to support the project gutenberg standard, which calls for italics to be marked in the .txt format... > Presumably I followed the instructions on the PG website no, you most certainly failed to follow those instructions... > http://www.gutenberg.org/wiki/Gutenberg:Volunteers'_FAQ#V.94._What_sh ould_I_do_with_italics.3F you will see that it says: > Underscores are now the effective standard for italics in PG texts. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From traverso at posso.dm.unipi.it Wed Sep 23 22:38:57 2009 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Thu, 24 Sep 2009 07:38:57 +0200 (CEST) Subject: [gutvol-d] Re: PG French text file #1500 In-Reply-To: <20090923203540.GA8486@pglaf.org> (message from Greg Newby on Wed, 23 Sep 2009 13:35:40 -0700) References: <39275292.87509.1253569246873.JavaMail.mail@webmail11> <20090922052946.BF28C10138@cardano.dm.unipi.it> <20090923203540.GA8486@pglaf.org> Message-ID: <20090924053857.59CB010138@cardano.dm.unipi.it> Yes, we plan to have the first volume complete in approximately one week from now. The complete edition is in 4 volumes, the other three volumes will appear subsequently, I hope in a few months (they have passed P1, need P2 and F1, and will skip P3 and F2 that will be done offline comparing with the wikimedia edition). I have cross-checked a part after P2 with the wikimedia edition, and the comparison runs smoothly, identifying a small set of remaining transcription errors evenly split between the two. The result might be error-free. The book is in the Pagnerre 12th edition 1848, and the first volume is just a revision of the first 1835 edition (I take the informations from http://www.loa.org/volume.jsp?RequestID=202§ion=notes ) so it is complete in itself. If possible, it would be nice to reserve 4 slots that would fit the complete set. The last edition published by Tocqueville is the 13th, that has an additional appendix. The 12th and 13th editions are regarded as the definitive editions. The 13th is available at the Internet Archive, I will see what is reasonable to do to make the PG edition the authoritative online edition. Maybe a transcribers note at the end, with an analysis of the differences, and the additional material of the 13th. Carlo >>>>> "Greg" == Greg Newby writes: Greg> On Tue, Sep 22, 2009 at 02:20:15AM -0700, Michael S. Hart Greg> wrote: >> >> >> There are still 14 more French eBooks to go, so I should hope >> we can get this one done in time to be #1500, please give it a >> go. Greg> Carlo: If this could be fast-tracked, it would be great. We Greg> would love to plan on Toqueville as French #1500. Greg> Can you let Michael and I know if this seems likely, so we Greg> can plan accordingly? Greg> Thanks much. -- Greg >> Thanks!!! >> >> >> Michael >> >> >> >> >> On Tue, 22 Sep 2009, Carlo Traverso wrote: >> >> > >> > Currently Tocqueville is in proof at DP, in 4 parts between >> P2 and > P3. It might be fast-tracked if PG wants it, but >> dozens of french > projects might come before anyway. >> > >> > Carlo From schultzk at uni-trier.de Wed Sep 23 23:43:17 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 24 Sep 2009 08:43:17 +0200 Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity comes along ... In-Reply-To: <1895795374.134329.1253709442237.JavaMail.mail@webmail07> References: <1895795374.134329.1253709442237.JavaMail.mail@webmail07> Message-ID: Hi There, Am 23.09.2009 um 14:37 schrieb Joshua Hutchinson: > Keith, > > While I agree that Marcello's diplomacy is terrible (always has > been, doubt that's going to change! :) ... he's right and you're > wrong. > > He never claimed the DB has to reindex and he presented very real > reasons why your solution is terrible from an efficiency point of > view. > > Biggest problem (summary): Your solution does the work on the web > server, his solution does it on the DB server. > > Josh > In my original post I NEVER said where this code could be used whether on the web server or DB server. Furthermore, I mentioned that the standard sort routines used in a DB server can be overiden and the supoopsed code can be used. So, the question of efficiency is mute. My solution will work anywhere you want it to. Another reason the the socalled efficency arguement is mute is that the web-server is calling the db-server that is actually doing all the work. As for Marcello attitude I personally could care less. All I wanted to do is help and pointed to a simple fact that the sort routine for the data is easy enough to implement. It is not always good enough to use just built-ins. Which I assume is the case. The matter of fact remains that the proper sorting can be easily achieved anywhere in the system with out any overhead. Position it where you want. I have programmed just such a situation and had no overhead. the database that was acessed via a web-server was set so that no new sorting or indexing was required when the db was called. I do know what I am doing an what can be done. From marcello at perathoner.de Thu Sep 24 04:00:06 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 24 Sep 2009 13:00:06 +0200 Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity comes along ... In-Reply-To: References: <1895795374.134329.1253709442237.JavaMail.mail@webmail07> Message-ID: <4ABB5136.5020001@perathoner.de> Keith J. Schultz wrote: > In my original post I NEVER said where this code could be used whether > on the web server or DB server. If you intended your code to be run on the database server, it would be an even more incredibly stupid thing to do. If you want to influence the database server, then you must not write a routine that *compares* strings, but a routine that *transforms* strings: If you wanted to sort like German phonebooks do, that is: to sort '?' as 'oe', then you would write a routine that substitutes all '?'s in your input with 'oe's. Then you would feed the transformed string to the index table while feeding the original string to the data table. Voil?. Of course you would have to transform all search terms in the same fashion too, because databases do most of the work on index tables and reach out to the data tables only when they really really really need to. > Furthermore, I mentioned that the standard sort routines used in a DB > server can be overiden and the supoopsed code can be used. How would you know? You don't even know which database we are using. > My solution will work anywhere you want it to. Your `solution? didn't even work on paper. I found 3 fat bugs just on a first eyeball revue. > I have programmed just such a situation and had no overhead. the > database that was acessed via a web-server was set so that no new > sorting or indexing was required when the db was called. You programmed a small in-house application that gets hit a dozen times a day. In your situation you can get away with any amount of programming sloppiness because the hardware is so much superior to the task. I am running a site that gets more than 1 Megahits a day, serving more than 70,000 customers per day. Failure to consider scalability issues is another telltale sign of the rookie. -- Marcello Perathoner webmaster at gutenberg.org From schultzk at uni-trier.de Fri Sep 25 04:27:00 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Fri, 25 Sep 2009 13:27:00 +0200 Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity comes along ... In-Reply-To: <4ABB5136.5020001@perathoner.de> References: <1895795374.134329.1253709442237.JavaMail.mail@webmail07> <4ABB5136.5020001@perathoner.de> Message-ID: <9968FCBD-40FF-4E4A-BD37-E4A65C6D4750@uni-trier.de> Hi Marcello, As you evidently do not know even the slightest about dbase systems I will stop responding to your arguments. I will repeat I was using pseudo-code so you could use what ever is approriate for the task. The basic algorithm wil work with any kind of data or structures for that matter. All that is need is an appropriate cardnal function. You have fail to realize this. Also, I do not what kind of server system you are using, but I have known system that can handle "?" for decades. Like I said you are not evidently qualified to partake in this discussion. regards Keith. Am 24.09.2009 um 13:00 schrieb Marcello Perathoner: > Keith J. Schultz wrote: > >> In my original post I NEVER said where this code could be used >> whether on the web server or DB server. > > If you intended your code to be run on the database server, it would > be an even more incredibly stupid thing to do. > > If you want to influence the database server, then you must not > write a routine that *compares* strings, but a routine that > *transforms* strings: > > If you wanted to sort like German phonebooks do, that is: to sort > '?' as 'oe', then you would write a routine that substitutes all > '?'s in your input with 'oe's. Then you would feed the transformed > string to the index table while feeding the original string to the > data table. Voil?. > > Of course you would have to transform all search terms in the same > fashion too, because databases do most of the work on index tables > and reach out to the data tables only when they really really really > need to. > > >> Furthermore, I mentioned that the standard sort routines used in a >> DB server can be overiden and the supoopsed code can be used. > > How would you know? You don't even know which database we are using. > > >> My solution will work anywhere you want it to. > > Your `solution? didn't even work on paper. I found 3 fat bugs just > on a first eyeball revue. > > >> I have programmed just such a situation and had no overhead. the >> database that was acessed via a web-server was set so that no new >> sorting or indexing was required when the db was called. > > You programmed a small in-house application that gets hit a dozen > times a day. In your situation you can get away with any amount of > programming sloppiness because the hardware is so much superior to > the task. > > I am running a site that gets more than 1 Megahits a day, serving > more than 70,000 customers per day. > > Failure to consider scalability issues is another telltale sign of > the rookie. > > > -- > Marcello Perathoner > webmaster at gutenberg.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From schultzk at uni-trier.de Fri Sep 25 04:39:32 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Fri, 25 Sep 2009 13:39:32 +0200 Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity comes along ... In-Reply-To: <4ABB5136.5020001@perathoner.de> References: <1895795374.134329.1253709442237.JavaMail.mail@webmail07> <4ABB5136.5020001@perathoner.de> Message-ID: <6E86B2B6-36AB-424A-A286-26B2D6FA0803@uni-trier.de> Hi Marcello, just you more after thought. I assume you have a database. If so just add another or more field for sorting and adjust the information from other fields. this is a one time trip and you have the sorting that you need, just use sort by functions. Of course I am assuming that your database is structured. regards Keith. Am 24.09.2009 um 13:00 schrieb Marcello Perathoner: > Keith J. Schultz wrote: > >> In my original post I NEVER said where this code could be used >> whether on the web server or DB server. > > If you intended your code to be run on the database server, it would > be an even more incredibly stupid thing to do. > > If you want to influence the database server, then you must not > write a routine that *compares* strings, but a routine that > *transforms* strings: > > If you wanted to sort like German phonebooks do, that is: to sort > '?' as 'oe', then you would write a routine that substitutes all > '?'s in your input with 'oe's. Then you would feed the transformed > string to the index table while feeding the original string to the > data table. Voil?. > > Of course you would have to transform all search terms in the same > fashion too, because databases do most of the work on index tables > and reach out to the data tables only when they really really really > need to. > > >> Furthermore, I mentioned that the standard sort routines used in a >> DB server can be overiden and the supoopsed code can be used. > > How would you know? You don't even know which database we are using. > > >> My solution will work anywhere you want it to. > > Your `solution? didn't even work on paper. I found 3 fat bugs just > on a first eyeball revue. > > >> I have programmed just such a situation and had no overhead. the >> database that was acessed via a web-server was set so that no new >> sorting or indexing was required when the db was called. > > You programmed a small in-house application that gets hit a dozen > times a day. In your situation you can get away with any amount of > programming sloppiness because the hardware is so much superior to > the task. > > I am running a site that gets more than 1 Megahits a day, serving > more than 70,000 customers per day. > > Failure to consider scalability issues is another telltale sign of > the rookie. > > > -- > Marcello Perathoner > webmaster at gutenberg.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From jimad at msn.com Fri Sep 25 10:36:27 2009 From: jimad at msn.com (James Adcock) Date: Fri, 25 Sep 2009 10:36:27 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: >but because it doesn't look the way you thought it _would_ look, you don't recognize that it's exactly what you were looking for... I reject it not only because its ugly, doesn't have any decent tools to support it, isn't supported or advocated by anyone world-wide except an army of one, and will not be used by the other volunteers in any case, but more importantly because I find cases on a daily basis cases of things I need to encode as a transcriber where I say "well obviously there would be no good way to address *this* issue using Bowerbird's scheme." And then, having established one has to transcribe into an ugly format, which I certainly think html, xml, and TEI are also, one comes rapidly to the conclusion that there is no way that an input transcription format and an output rendered file format *ought* to be one and the same thing because to do so needlessly subjects the end reader to unnecessary ugliness. Not to mention that PG is rendering to 80 different output file formats in any case so why *insist* that there be only one input transcription format "holy grail" in the first place? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Fri Sep 25 11:31:21 2009 From: jimad at msn.com (James Adcock) Date: Fri, 25 Sep 2009 11:31:21 -0700 Subject: [gutvol-d] Re: let's wind this down In-Reply-To: References: Message-ID: >what? the d.p. process requires the introduction of errors? Yes, in the encoding of m-dash, ellipses, etc. >no, you most certainly failed to follow those instructions... > http://www.gutenberg.org/wiki/Gutenberg:Volunteers'_FAQ#V.94._What_should_I_ do_with_italics.3F First of all, again, if this is important to PG then why do they not properly index it to the PG site's search engine? Secondly, you refuse to read the immediately preceding section FAQ#V.93 which makes it clear that different volunteers have different priorities about what "plain text" means and how they will be willing to support it and will be using different automatic conversion tools and that some of the volunteers (read: me) will be paying no weight to the desire of other volunteers to make tools to do "automatic prettyprinting" from the "plain text" whereas other volunteers (yourself) are willing to insert "ugliness" into the plain text (their words not mine) in order to better support prettyprinters such as you are proposing. Finally, you and others at PG are forgetting to heed the closing words given there: Getting a text on-line is the important thing; which choices you [meaning me] make in doing so is a matter of detail. The choices *I* make as a volunteer are to put my time and effort into doing ONE markup as well as I can namely HTML, and as little time and effort as possible on TXT files -- because for all the arguments raised here I think TXT is a loser and a no-win situation for the volunteer transcriber - no matter HOW one makes the unhappy tradeoffs *required* by TXT someone will end up unhappy and start "beefing" at you. And the reason that PG is not willing to provide an automatic tool to reduce HTML to TXT is because they know that then THEY not the volunteer transcribers will be the unhappy recipients of these kinds of diatribes. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 25 14:31:51 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Sep 2009 17:31:51 EDT Subject: [gutvol-d] Re: let's wind this down Message-ID: jim said: > Yes, in the encoding of m-dash, ellipses, etc. i was being sarcastic, jim. just like you, i have railed about the way that d.p. (mis)handles things like em-dashes and ellipses. > First of all, again, if this is important to PG > then why do they not properly index it > to the PG site?s search engine? i dunno. you'll have to ask p.g., not me. but i agree their instructions suck. i get the impression that they don't even want individual digitizers to do books any more. they want to channel the labor over to d.p., which would be fine, except the d.p. workflow wastes _so_ much volunteer time and energy. > Secondly, you refuse to read the immediately > preceding section FAQ#V.93 which makes it clear that > different volunteers have different priorities about > what ?plain text? means and how they will be willing > to support it and will be using different automatic > conversion tools and that some of the volunteers > (read: me) will be paying no weight to the desire of > other volunteers to make tools to do ?automatic > prettyprinting? from the ?plain text? whereas other > volunteers (yourself) are willing to insert ?ugliness? > into the plain text (their words not mine) in order to > better support prettyprinters such as you are proposing. you can try and do all the doubletalk that you want, jim, but the fact remains that there is a policy, and it is clear: > Underscores are now the effective standard for italics in PG texts. your .txt version failed to meet the standard, jim. face it. > Finally, you and others at PG are forgetting to heed > the closing words given there: > > Getting a text on-line is the important thing; > > which choices you [meaning me] make in doing so > > is a matter of detail. do you think this gives you the ok to ignore the italics? not only did your .txt version fail to meet the standard, but now you're telling us you don't have to meet that? how about we get a ruling from the p.g. people on this? are your digitizers free to ignore the italics if they like? are your digitizers free to ignore any rule they dislike? > The choices *I* make as a volunteer are to put my > time and effort into doing ONE markup as well as I can > namely HTML, and as little time and effort as possible > on TXT files and because you're putting so little time and effort into your .txt files, they are coming out as inferior. again, how about a ruling from the p.g. powers-that-be? are digitizers free to make the .txt files as bad as they choose? i will keep asking this question until it is answered, so don't think that you can ignore it and it will just go away. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 25 14:41:36 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Sep 2009 17:41:36 EDT Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) Message-ID: jim said: > I reject it not only because its ugly, it's not ugly. besides, it's a _file-format_, which means it's not even intended that you would look at it directly, any more than you are intended to look at an .html file directly, with its obtrusive angle-brackets. you're mixed up. > doesn?t have any decent tools to support it i have a whole slew of tools here, and am building more. further, the format is so simple, authors can build tools too. > isn?t supported or advocated by anyone world-wide > except an army of one, i'm not an army. i'm just one. > and will not be used by the other volunteers in any case won't be used by the d.p. people, that's for sure, not if they know it's from me, because they are so stubborn they don't know what's good for them... which tickles my funny-bone on a constant basis... > but more importantly because I find cases on a daily basis > cases of things I need to encode as a transcriber where I say > ?well obviously there would be no good way to address > *this* issue using Bowerbird?s scheme.?? well, that doesn't surprise me on bit, jim, because you don't know jack-shit about my little "scheme". but i am being quite sincere when i tell you that i would _love_ to hear about these so-called "cases." you should be told that i have put out many calls for such "cases", and nobody has ever been able to meet the challenge. so step up, jim, and be the first. > And then, having established one has to transcribe > into an ugly format, which I certainly think html, xml, > and TEI are also, one comes rapidly to the conclusion that > there is no way that an input transcription format and > an output rendered file format *ought* to be one and > the same thing because to do so needlessly subjects > the end reader to unnecessary ugliness. jim, you keep talking about "input" and "output", and you're just confusing yourself with that terminology... > Not to mention that PG is rendering to 80 different > output file formats in any case so why *insist* that there > be only one input transcription format ?holy grail? > in the first place? the benefit of a "master" format is that you only have to store and maintain that one format. so it's cost-effective. but thanks for playing, we'll have a consolation gift for you. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Fri Sep 25 15:35:31 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Fri, 25 Sep 2009 15:35:31 -0700 Subject: [gutvol-d] Re: let's wind this down References: Message-ID: <71D2BEDB5309484AB6E9D5AD0316C65A@alp2400> Wearing my Whitewasher hat (I've got a producer's hat around here somewhere, too ), I'll answer some of the issues raised below, but I won't argue about them. I leave that to others. Q1: are your digitizers free to ignore the italics if they like? A1: They can, but they'll be referred to the PG Volunteers' FAQ and How-to article(s), and asked to make the necessary corrections. Q2: are your digitizers free to ignore any rule they dislike? A2: Some rules/principles can be ignored in specific cases, e.g. line lengths can exceed 75 characters for highly structured material such as tables, poetry, and such-like. There may be other occasions for bending the rules, but A1 above should be kept in mind. Q3: are digitizers free to make the .txt files as bad as they choose? A3: See A1 It is to be hoped that submitters realize that a plain text file can and should carry almost as much information as any other format. Obviously, they can't carry illustrations or typeface info, but they can certainly carry all the words (in the vast majority of books, the only things that really count) and most, if not all, of the standard and near-standard emphasis indicators (e.g. underscores for italics). ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Friday, September 25, 2009 2:31 PM Subject: [gutvol-d] Re: let's wind this down jim said: > Yes, in the encoding of m-dash, ellipses, etc. i was being sarcastic, jim. just like you, i have railed about the way that d.p. (mis)handles things like em-dashes and ellipses. > First of all, again, if this is important to PG > then why do they not properly index it > to the PG site?s search engine? i dunno. you'll have to ask p.g., not me. but i agree their instructions suck. i get the impression that they don't even want individual digitizers to do books any more. they want to channel the labor over to d.p., which would be fine, except the d.p. workflow wastes _so_ much volunteer time and energy. > Secondly, you refuse to read the immediately > preceding section FAQ#V.93 which makes it clear that > different volunteers have different priorities about > what ?plain text? means and how they will be willing > to support it and will be using different automatic > conversion tools and that some of the volunteers > (read: me) will be paying no weight to the desire of > other volunteers to make tools to do ?automatic > prettyprinting? from the ?plain text? whereas other > volunteers (yourself) are willing to insert ?ugliness? > into the plain text (their words not mine) in order to > better support prettyprinters such as you are proposing. you can try and do all the doubletalk that you want, jim, but the fact remains that there is a policy, and it is clear: > Underscores are now the effective standard for italics in PG texts. your .txt version failed to meet the standard, jim. face it. > Finally, you and others at PG are forgetting to heed > the closing words given there: > > Getting a text on-line is the important thing; > > which choices you [meaning me] make in doing so > > is a matter of detail. do you think this gives you the ok to ignore the italics? not only did your .txt version fail to meet the standard, but now you're telling us you don't have to meet that? how about we get a ruling from the p.g. people on this? are your digitizers free to ignore the italics if they like? are your digitizers free to ignore any rule they dislike? > The choices *I* make as a volunteer are to put my > time and effort into doing ONE markup as well as I can > namely HTML, and as little time and effort as possible > on TXT files and because you're putting so little time and effort into your .txt files, they are coming out as inferior. again, how about a ruling from the p.g. powers-that-be? are digitizers free to make the .txt files as bad as they choose? i will keep asking this question until it is answered, so don't think that you can ignore it and it will just go away. -bowerbird _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d From jimad at msn.com Fri Sep 25 16:20:07 2009 From: jimad at msn.com (Jim Adcock) Date: Fri, 25 Sep 2009 16:20:07 -0700 Subject: [gutvol-d] Re: why the plain-text format is the most useful format for eliciting beauty (and more) In-Reply-To: References: Message-ID: >i have a whole slew of tools here, and am building more. further, the format is so simple, authors can build tools too. Once you have the tools done, I will try them, in spite of the fact that what I find over and over again is that tools touted by people on DP and PG 1) fail to even install correctly, 2) and when I try them they really don't do anything useful to help me make books. If the tools prove to be useful, then I will happily put up with an ugly file coding format. From Bowerbird at aol.com Mon Sep 28 12:06:47 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 28 Sep 2009 15:06:47 EDT Subject: [gutvol-d] 2 weeks in, princeton students hate kindle Message-ID: princeton students have been using the kindle... > after two weeks of use in three classes, > the Daily Princetonian reports many are > "dissatisfied and uncomfortable" with their > e-readers, with one student calling it > "a poor excuse of an academic tool." > Most of the criticisms center around > the Kindle's weak annotation features, > which make things like highlighting and > margin notes almost impossible to use, > but even a simple thing like the lack of > true page numbers has caused problems, > since allowing students to cite the Kindle's > location numbers in their papers is > "meaningless for anyone working from > analog books." oops... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pobox.com Mon Sep 28 12:41:41 2009 From: hart at pobox.com (Michael S. Hart) Date: Mon, 28 Sep 2009 12:41:41 -0700 (PDT) Subject: [gutvol-d] Re: 2 weeks in, princeton students hate kindle In-Reply-To: References: Message-ID: I tried out the new Sonys yesterday, can't say I liked them at all. mh From Bowerbird at aol.com Mon Sep 28 12:49:56 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 28 Sep 2009 15:49:56 EDT Subject: [gutvol-d] Re: 2 weeks in, princeton students hate kindle Message-ID: michael said: > I tried out the new Sonys yesterday, can't say I liked them at all. the sony is no better than the kindle, for school use. but for your own use, michael, why didn't you like it? (i'm not surprised by that, but just would like specifics.) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Tue Sep 29 11:57:29 2009 From: jimad at msn.com (Jim Adcock) Date: Tue, 29 Sep 2009 11:57:29 -0700 Subject: [gutvol-d] Re: 2 weeks in, princeton students hate kindle In-Reply-To: References: Message-ID: Not sure which of the Kindles the Princeton students got, but if they got the Kindle DX, then they could be reading documents in PDF mode, in which case the pages look the same as paper pages, including page numbers. Re marking up pages, yes Kindle allows one to make annotations, but the ability is pretty weak. Page number complaints re Kindle could just as well be page number complaints for PG, since we seldom keep the page numbers. Just heard an NPR report on students complaining about buying paper books for college -- $150 new, $100 used, and $75 if rented for one semester. From gbnewby at pglaf.org Tue Sep 29 16:58:32 2009 From: gbnewby at pglaf.org (Greg Newby) Date: Tue, 29 Sep 2009 16:58:32 -0700 Subject: [gutvol-d] yolink add-on Message-ID: <20090929235832.GA17153@pglaf.org> Did I already forward this information? Sorry if so. This is a search add-on. The info is provided by the producers. I've tried it, and it worked well: Current search tools don???t help you find and analyze information quickly. yolink is a unique and powerful free browser add-on which takes search to the next level: ??? cut down search time by as much as 90% by reducing clicks ??? analyze search results quickly ??? content is delivered to you with keywords highlighted ??? enhance your search ??? yolink shines at searching through links and electronic documents for multiple terms and relationships ??? understand information in context ??? unlike typical search results that present a couple of lines of information, yolink displays your keywords in its full context so you can identify and associate related information ??? unlock the written word ??? use yolink???s ability to search for and display multiple keywords in context to quickly analyze books for themes, relationships, and quotes The Web is all about hyperlinks and digital content. Today???s search engines return lists of links that you need to click-through to find information. Below is a link to a short video using yolink on a Gutenberg.org electronic book http://www.yolink.com/yolink/media/gutenburglarge.jsp yolink also provides a hosted archiving and collaboration platform with its Save & Share feature. yolink Save & Share allows you to quickly organize and share information, including accessing the information from mobile devices such as the iPhone. Download yolink today at www.yolink.com From schultzk at uni-trier.de Wed Sep 30 01:26:26 2009 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 30 Sep 2009 10:26:26 +0200 Subject: [gutvol-d] Re: 2 weeks in, princeton students hate kindle(slightly OT) In-Reply-To: References: Message-ID: Hi All, As a acedemic I can understand the problem with cite free form texts without any true text pages. But there are wayws to get around this. One can use chapters. Also, one might add information on how one formated the text to make things easier. Another method could also, be to use lines numbers. Similar problems arise when citing texst on the web. Texts on the web give rise to another problem: sometimes texts are for one reason or another no longer on the web!! All this makes it difficult for acedemics to cite correctly. For students it can even degrade thier papers as the can not truely cite in a correct manner thereby get a poorer rating. It also, makes it harder for someone to research a hypothesis. Also, a research may not have a Kindle (e.g) to check the cite and its ramifications. One can always use other sources but that carries other problems with it. I personally would not consider the Kindle a acedemic toll, but a reading aide and research tool. Just as I would do the same for a computer. Yes, a computer can be an acedemic research tool if use properly. The Kindle was not developed as a research tool. Just because students are allowed to use them does not make them a acedemic tool. The idea was to reduce cost for the students. For truely acedemic work one would use other sources than e-book readers. The only real reason to use them in the acedemic field for research is if they are the only source to get the information. This goes just the same for texts produced by PG. These texts can only be a starting point not the end. regards Keith. Am 28.09.2009 um 21:06 schrieb Bowerbird at aol.com: > princeton students have been using the kindle... > > > after two weeks of use in three classes, > > the Daily Princetonian reports many are > > "dissatisfied and uncomfortable" with their > > e-readers, with one student calling it > > "a poor excuse of an academic tool." > > Most of the criticisms center around > > the Kindle's weak annotation features, > > which make things like highlighting and > > margin notes almost impossible to use, > > but even a simple thing like the lack of > > true page numbers has caused problems, > > since allowing students to cite the Kindle's > > location numbers in their papers is > > "meaningless for anyone working from > > analog books." > > oops... > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From answerwitch at gmail.com Wed Sep 30 06:59:11 2009 From: answerwitch at gmail.com (Mjit RaindancerStahl) Date: Wed, 30 Sep 2009 09:59:11 -0400 Subject: [gutvol-d] Re: 2 weeks in, princeton students hate kindle(slightly OT) In-Reply-To: References: Message-ID: <2f9d57a0909300659i6f1ef650h819f722da695c0@mail.gmail.com> Per APA citation guidelines: if you can't cite the page number, you cite the paragraph number. They are right that a format specific reference point is useless outside the format. -- Mjit RaindancerStahl answerwitch at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbnewby at pglaf.org Wed Sep 30 14:42:35 2009 From: gbnewby at pglaf.org (Greg Newby) Date: Wed, 30 Sep 2009 14:42:35 -0700 Subject: [gutvol-d] Text to speech service Message-ID: <20090930214235.GB30174@pglaf.org> FYI. This is not free software, but seems interesting for folks looking to do online text to speech conversion: From: Joe messanella To: gbnewby at pglaf.org Subject: Re: text to speech solution blurb The web is essential for most of us , unfortunately 20% of the US population have various reading issues! 2 years ago I experienced a racing bike accident, yes, over the handle bars and onto my head. For 6 weeks my balance was compromised, pronunciation was inconsistent, my world - gray! I had limited tolerance for "computers," 6 months later the effects could still be felt. Still, as unfortunate as that may seem, I may now a better person for it. At least my sense of purpose has been renewed. I will soon be working for ( www.voice-corp.com ), a small company that offers inexpensive "service based" text to speech solutions, web visitors just click on the listen icon and a player reads the web text or downloads it in mp3 format. I did not mention my accident during my interview however! Would it matter? joe.messanella at voice-corp.com