From schultzk at uni-trier.de Mon Feb 1 02:02:35 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 1 Feb 2010 11:02:35 +0100 Subject: [gutvol-d] Re: More iPad details In-Reply-To: References: Message-ID: Hi Jim, You have made my point. The point remains, that text -to-speech is a important component, but it does not constitute designed for the blind or ... As you mentioned the blind will mostly, get more hardware and software better suited to thier needs. BTW. Macs have had text-to-speech for decades, too. regards Keith. Am 29.01.2010 um 21:24 schrieb Jim Adcock: >> I find your argument mute. As most computers > are not design for the blind or sight impair. > Sure they can be modified for use with the > blind. > > I don't understand your comments. Modern computers have many > "accessibility" features built-in. HTML has "accessibility" features > built-in. Granted a blind user will probably want to buy a 3rd party screen > reader app to best make use of the accessibility features built into > computers -- but then again the sighted iPad user will have to download a > separate Apple app just to be able to read books! Windows 7 comes with a > basic screen reader. For an overview of these issues see for example: > > http://www.microsoft.com/enable/ > > Blind users have been using text-to-speech with computers since DECtalk > 1984. A notable user you have probably seen and heard on TV is Stephen > Hawkings. From schultzk at uni-trier.de Mon Feb 1 02:06:15 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 1 Feb 2010 11:06:15 +0100 Subject: [gutvol-d] Re: More iPad details In-Reply-To: References: <627d59b81001290940q58464655n51e14df6fbd06939@mail.gmail.com> Message-ID: <2C8597D5-3FC6-4EC0-97C5-AB337BD96815@uni-trier.de> Am 29.01.2010 um 21:59 schrieb Jim Adcock: >> I have a sight-impaired friend who would appreciate having one of those > Kindles drop-kicked in his direction. He figures he can deal with the > buttons somehow. > > Here is a reference to the National Federation of the Blind lawsuit over > Kindle use on College Campuses, which was concluded by ending the Kindle > campus program in progress, and Kindle agreeing to improve accessibility. > The lawsuit alleged that the Kindles were inaccessible to blind students and > thus violate federal law. > > http://www.nfb.org/nfb/NewsBot.asp?MODE=VIEW&ID=527 > > So hopefully Kindles will someday soon be able to speak the buttons and the > list of book titles and authors. Can't find any place that Amazon talks > about this issue -- not surprisingly! Hopefully Apple and iPad have enough > experience that they will not step into the same puddle! Apple has the technology for text-to-speech. They should be able to port it. They managed iWorks, i am sure they can manage text-to speech. The question remains if the iPad will still preform well. We have to wait and see. regards Keith. From schultzk at uni-trier.de Mon Feb 1 02:13:07 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 1 Feb 2010 11:13:07 +0100 Subject: [gutvol-d] Re: Transferring PG files from PC to iPod In-Reply-To: References: Message-ID: <66B1B3B6-ADE9-46E5-AFFF-43DB5AE8C328@uni-trier.de> Hi BB, I feelt you were unnessarily hard with some of your typical comments. My comments were meant to hit the nerve they hit. I did not mean to get on your back. Just a demonstration of being on the other side of certain comments. Far as the problem of formating we do live on the same planet. Though often enough in different cultures. Take care. regards Keith. Am 29.01.2010 um 18:33 schrieb Bowerbird at aol.com: > keith said: > > Sorry, BB I think You did not due Walter and Andrew justice. > > They did not attack anyone and just stated their views. > > You could have just have just mention the advantages of > > eucalyptus. But, why be so sarcastic here. > > hey, back off, keith, now. > > i didn't "attack" anyone, not by any stretch of the imagination. > > i just disagreed with something walter said. or, more specifically, > i asked for clarification, and registered a few counter-thoughts... > > and i don't appreciate it when people mistake my motives and > then mischaracterize them as if they had some handle on them. > > you've made a mistake here, keith, a bad mistake, and if i were > the whining kind, i'd probably demand some kind of apology, > but as it is, i'm just warning you to stop making that mistake... > > > > You could have just have just mention the advantages of > > eucalyptus. But, why be so sarcastic here. > > i _did_ mention the "advantage" of eucalyptus, nice formatting. > > but that just introduces the same question i asked about stanza, > namely, "what is it that _constitutes_ nice and proper formatting?" > > this is a good question, one that really _needs_ to be asked, so > that we can then go on and ask more sophisticated questions, > such as "how do we apply that formatting?", and "what kind of > rule-set is eucalyptus following in order to apply its formatting?", > and so on. as it is, though, as evidenced by the mess of formats > coming out of d.p., there is a wide range of "formatting" that > _could_ be considered "proper", so it's rather meaningless when > someone refers to "proper formatting", and it's good to know that. > it doesn't mean they are "wrong", but it _does_ mean that we are > justified in asking them precisely what _they_ mean by the term... > > and further, there is no "sarcasm" here. i'm plenty capable of being > sarcastic it, it's something i do often, and fairly well, although there > probably isn't much "honor" in that performance in most eyes, but > there's no reason to think that everything that i do is "sarcastic"... > if you pay any attention at all, it should be quite easy to see when > i am being sarcastic and when i'm not. so keith, _pay_attention_, > at least if you're going to make commentary. > > also, i'm not sure if eucalyptus uses the utf8 version of files or not. > plain-text doesn't rule out an encoding -- or even utf8 -- you know. > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Mon Feb 1 11:02:26 2010 From: jimad at msn.com (James Adcock) Date: Mon, 1 Feb 2010 11:02:26 -0800 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: <4B65299A.7060304@perathoner.de> References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> Message-ID: >> So why do the PG generated mobis do not have a TOC ? >Better ask mobipocket. We use their official 'mobigen' conversion tool for linux. Mobipocket is Amazon. The latest version of mobigen is called kindlegen at: http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621 Not that that seems to help any. Supposedly it has better support now for NCX but that didn't cause it to create a TOC for me. What did "work" for me was either of the following two approaches: 1) Use Calibre ebook-convert instead of mobigen. Apply it to either your epub or your opf and it generates a book including TOC. You'd need to check of course that it doesn't introduce other problems for you. Or: 2) I CAN generate a TOC using kindlegen and using your opf (extracted from your epub files) when I perform the following changes: a) in your opf explicitly add a toc.htm file ... ... And Where then the toc.htm contains basically the same information you are already generating for the toc.ncx except in html format -- which then begs the question what "support" for NCX actually means? But in any case taking this approach (which you can see is also the approach taken in the worked "Sample" book example distributed with kindlegen) creates a MOBI file with TOC support as users would expect. From jimad at msn.com Mon Feb 1 11:06:28 2010 From: jimad at msn.com (James Adcock) Date: Mon, 1 Feb 2010 11:06:28 -0800 Subject: [gutvol-d] Re: More iPad details In-Reply-To: References: Message-ID: >You have made my point. Well, I am happy to have made your point -- but I still have no idea what your point is? From jimad at msn.com Mon Feb 1 11:28:49 2010 From: jimad at msn.com (James Adcock) Date: Mon, 1 Feb 2010 11:28:49 -0800 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> Message-ID: Sorry, also just got an Amazon email pointing me to this doc: http://s3.amazonaws.com/kindlegen/AmazonKindlePublishingGuidelinesV1.3.pdf where page 11 it says: TOC guideline #1: the Logical TOC (NCX) is mandatory The Logical Table Of Contents is very important for our mutual customer's reading experience as it allows them to easily navigate between chapters on Kindle 2. So all Kindle books should have both logical and HTML TOCs. Users expect to see an HTML TOC when paging through a book from the beginning, while the logical table of contents is an additional way for users to navigate books. So indeed they want both the toc.ncx and the toc.htm -- still haven't figured out what they think they are doing with the toc.ncx! -----Original Message----- From: gutvol-d-bounces at lists.pglaf.org [mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of James Adcock Sent: Monday, February 01, 2010 11:02 AM To: 'Project Gutenberg Volunteer Discussion' Subject: [gutvol-d] Re: Formats and gripes >> So why do the PG generated mobis do not have a TOC ? >Better ask mobipocket. We use their official 'mobigen' conversion tool for linux. Mobipocket is Amazon. The latest version of mobigen is called kindlegen at: http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621 Not that that seems to help any. Supposedly it has better support now for NCX but that didn't cause it to create a TOC for me. What did "work" for me was either of the following two approaches: 1) Use Calibre ebook-convert instead of mobigen. Apply it to either your epub or your opf and it generates a book including TOC. You'd need to check of course that it doesn't introduce other problems for you. Or: 2) I CAN generate a TOC using kindlegen and using your opf (extracted from your epub files) when I perform the following changes: a) in your opf explicitly add a toc.htm file ... ... And Where then the toc.htm contains basically the same information you are already generating for the toc.ncx except in html format -- which then begs the question what "support" for NCX actually means? But in any case taking this approach (which you can see is also the approach taken in the worked "Sample" book example distributed with kindlegen) creates a MOBI file with TOC support as users would expect. _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d From marcello at perathoner.de Mon Feb 1 11:32:30 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 01 Feb 2010 20:32:30 +0100 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> Message-ID: <4B672C4E.8030600@perathoner.de> James Adcock wrote: >>> So why do the PG generated mobis do not have a TOC ? > >> Better ask mobipocket. We use their official 'mobigen' conversion tool > for linux. > > Mobipocket is Amazon. The latest version of mobigen is called kindlegen at: > > http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621 kindlegen tells me that it builds a TOC. Now if Amazon would release their "Kindle for PC" for Linux too, I could actually check the generated files... knowing that the Kindle runs on Linux where's the big holdup? $ ./kindlegen pg31142.epub *********************************************** * Amazon.com kindlegen(Linux) V1.0 build 85 * * A command line e-book compiler * * Copyright Amazon.com 2009 * *********************************************** opt version: try to minimize (default) Info(prcgen): Added metadata dc:Title "On the Nature of Thought / or, The act of thinking and its connexion with a perspicuous sentence" Info(prcgen): Added metadata dc:Date "2010-01-31" Info(prcgen): Added metadata dc:Creator "John Haslam" Info(prcgen): Added metadata dc:Rights "Public domain in the USA." Info(prcgen): Added metadata dc:Source "http://www.gutenberg.org/files/31142/31142-h/31142-h.htm" Info(prcgen): Parsing files 0000001 Info(prcgen): Resolving hyperlinks Info(prcgen): Building table of content URL: /tmp/fileY6tCul/31142/toc.ncx Info(prcgen): Computing UNICODE ranges used in the book Info(prcgen): Found UNICODE range: Basic Latin [20..7E] Info(prcgen): Found UNICODE range: General Punctuation - Windows 1252 [2013..2014] Info(prcgen): Found UNICODE range: Latin-1 Supplement [A0..FF] Info(prcgen): Building MOBI file, record count: 0000023 Info(prcgen): Final stats - text compressed to (in % of original size): 054.13% Info(prcgen): The document identifier is: "On_the_Natur-cuous_sentence" Info(prcgen): The file format version is V6 Info(prcgen): Saving MOBI file Info(prcgen): MOBI File successfully generated! $ -- Marcello Perathoner webmaster at gutenberg.org From marcello at perathoner.de Mon Feb 1 11:56:39 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 01 Feb 2010 20:56:39 +0100 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> Message-ID: <4B6731F7.8050503@perathoner.de> James Adcock wrote: > 2. So all > Kindle books should have both logical and HTML TOCs. Users expect to see an > HTML > TOC when paging through a book from the beginning, while the logical table > of > contents is an additional way for users to navigate books. As most PG ebooks already contain a TOC inside the HTML, its pointless to generate another one. -- Marcello Perathoner webmaster at gutenberg.org From jimad at msn.com Mon Feb 1 12:23:02 2010 From: jimad at msn.com (James Adcock) Date: Mon, 1 Feb 2010 12:23:02 -0800 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: <4B6731F7.8050503@perathoner.de> References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> <4B6731F7.8050503@perathoner.de> Message-ID: >As most PG ebooks already contain a TOC inside the HTML, its pointless to generate another one. Sigh, you are going around in circles. The issue is that you are generating MOBI files that do not correctly implement the TOC standard of MOBI files. The result of this is that when a user of a PG file clicks on the dedicated "TOC" button on their e-reader device, the MOBI file you generate fails to take them to the TOC. This is a file format failure on the part of the file format YOU are generating. Yes, the HTML files that the creator of the PG book often also contain a "TOC" in HTML format. IF, for example, you were to generate a toc.htm pointing to the "TOC" already in one of the books' HTML files and correctly link that toc.htm to your opf file, THEN when the PG user clicks on the dedicated "TOC" button in their ebook reader then that TOC button WOULD correctly function and it would take them to the TOC the user has already generated in one of their html files. Or alternatively, if they have already created a file called toc.htm you could just link to that correctly as required in the opf file and everything would work. Or, if you are generating a TOC in NCX format you could with trivial changes also generate a toc.htm which you could correctly link into the opf file and then the TOC button would also work. Or you could use Calibre ebook-convert software which would do this automatically for you and again everything would actually work. But, instead you continue to pimp the resulting MOBI file format because YOU think YOU should be the one to choose which devices PG users should be reading on, rather than generating valid files in the file formats that PG customers need to read on the devices they already own. I think this is silly. Let the marketplace decide. If Amazon acts in an onerous way to customers then customers will choose to buy from Apple and read in EPUB format. If Apple acts in an onerous way to customers then customers will choose to buy from Amazon and will read in MOBI format. Having the choice helps drive the e-book vendors into less onerous behavior -- hopefully! -- So far all that Apple has succeeded in doing is driving up the price for new releases for all ebook readers from $9.99 to $15.99 -- thanks Jobs, that's quite an accomplishment! -----Original Message----- From: gutvol-d-bounces at lists.pglaf.org [mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of Marcello Perathoner Sent: Monday, February 01, 2010 11:57 AM To: Project Gutenberg Volunteer Discussion Subject: [gutvol-d] Re: Formats and gripes James Adcock wrote: > 2. So all > Kindle books should have both logical and HTML TOCs. Users expect to see an > HTML > TOC when paging through a book from the beginning, while the logical table > of > contents is an additional way for users to navigate books. As most PG ebooks already contain a TOC inside the HTML, its pointless to generate another one. -- Marcello Perathoner webmaster at gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d From marcello at perathoner.de Mon Feb 1 13:05:35 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 01 Feb 2010 22:05:35 +0100 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> <4B6731F7.8050503@perathoner.de> Message-ID: <4B67421F.7090500@perathoner.de> James Adcock wrote: >> As most PG ebooks already contain a TOC inside the HTML, its pointless > to generate another one. > > Sigh, you are going around in circles. The issue is that you are generating > MOBI files that do not correctly implement the TOC standard of MOBI files. mobigen is generating MOBI files that ... > The result of this is that when a user of a PG file clicks on the dedicated > "TOC" button on their e-reader device, the MOBI file you generate fails to > take them to the TOC. This is a file format failure on the part of the file > format YOU are generating. Not at all. The epub files I generate validate with epubcheck and the TOC displays correctly on a ADE readers. mobigen then, for whatever reason of its own, fumbles a perfectly valid toc.ncx in a perfectly valid epub file. This is Amazon's problem. I suggest they download a copy of the epub spec and give it to their developers. > Or you could use Calibre ebook-convert > software which would do this automatically for you and again everything > would actually work. Calibre is slow and converts everything first to an interim format (Sony LRF I think) which loses most formatting. But foremost calibre is a kitchen sink that has dozens of dependencies some of which I cannot install on ibiblio. E.g. it wants cherrypy v2 whereas I use cherrypy v3 for gutenberg development. (What calibre needs a web application server for is beyond me.) > But, instead you continue to pimp the resulting MOBI file format because YOU > think YOU should be the one to choose which devices PG users should be > reading on, rather than generating valid files in the file formats that PG > customers need to read on the devices they already own. I use the official kindlegen v 1.0 (as of today) that Amazon says publishers should use to generate files for the Kindle. Save your breath to complain to Amazon, because its their application that is broken and not my epub files. If my files don't pass epubcheck, I will fix them. If Amazon needs some non-standard gimmick inserted because they can't be bothered to implement the spec, then I will definitely NOT insert it. > I think this is silly. Let the marketplace decide. If Amazon acts in an > onerous way to customers then customers will choose to buy from Apple and > read in EPUB format. If Apple acts in an onerous way to customers then > customers will choose to buy from Amazon and will read in MOBI format. Let me know when `the marketplace? has fixed the bugs in their app. -- Marcello Perathoner webmaster at gutenberg.org From jimad at msn.com Mon Feb 1 13:26:36 2010 From: jimad at msn.com (Jim Adcock) Date: Mon, 1 Feb 2010 13:26:36 -0800 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: <4B67421F.7090500@perathoner.de> References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> <4B6731F7.8050503@perathoner.de> <4B67421F.7090500@perathoner.de> Message-ID: >Not at all. The epub files I generate validate with epubcheck and the TOC displays correctly on a ADE readers. You are leaving out the TOC element in the guide structure of the epub files. While it is legal to do so in EPUB [not MOBI], that ADE displays a "TOC" [actually the NCX guide structure] even when you are leaving out of the guide can be considered "an extension" at best, a non-conforming behavior of ADE at worse. NCX is NOT a "TOC" per se, see: http://www.openebook.org/2007/opf/OPF_2.0_final_spec.html#Section2.4.1 particularly 2.4.1.1 where in comparison it shows how TOC, List of Illustrations, etc are SUPPOSED to be implemented at: http://www.openebook.org/2007/opf/OPF_2.0_final_spec.html#Section2.6 From Bowerbird at aol.com Mon Feb 1 16:14:58 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 1 Feb 2010 19:14:58 EST Subject: [gutvol-d] roundlessness -- 001 Message-ID: <1c73a.392d55ff.3898c882@aol.com> welcome to february 2010... :+) roger frank (known as rfrank over at distributed proofreaders) is doing an experimental test of a roundless proofing system: > http://www.fadedpage.com i've argued for years and years that d.p. should go roundless, so i see this experiment as a wonderful thing, and i support it. i repeat, this is a very very very very very very very good thing... so worthwhile, in fact, that i will spend some time analyzing it, and offering up the valuable gift of some constructive criticism. i'm sure roger will be thrilled to hear it... in order to get the most out of my posts, you should probably go over and register at the site and do a little bit of work there. that way you'll get enough experience to have a feel for the site. you might wanna read the forums too, so as to grasp the issues. it won't take much time, and you'll acquaint you with the future. i'll probably have 28 days worth of material -- so settle in and make yourselves comfortable as we look at roger's experiment during the month of february... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From pterandon at gmail.com Mon Feb 1 18:38:14 2010 From: pterandon at gmail.com (Greg M. Johnson) Date: Mon, 1 Feb 2010 21:38:14 -0500 Subject: [gutvol-d] Psychology of interacting with (Google's) ebooks. Message-ID: I downloaded two epub's from Google Books and one or both of the book reading apps on my Android phone didn't even see one of them. I think that some of these collections are designed with the idea that the repository should be on the web, and you the reader should go search the web interface to find a book you want, then download that one book, have perfect confidence it's going to be cool to read and functioning properly, then maybe you'll go on to the next one a few days later. I don't think humans work that way. First of all web interfaces, especially on a phone, are inherently slow, and sometimes unavailable either due to wifi/ 3G coverage or due to embarassment about using "work bandwidth". The Google Books interface isn't *bad*, but it's still like being fed at a gourmet banquet with a baby spoon. The user may have one bad experience with a downloaded text, no matter how small, and they want to curate their own collection first, maybe hoard-up more books than they or their family could read in a lifetime, cull out the icky or malfunctioning texts, and then have say 20 on their reader and 2000 on a DVD in a safe in their basement. At least that's how I respond to having one or two minor problems. ;) I don't think that Google Books at least gets this. I spent so much time at Google Books, browsing in apparently spider-like fashion, that I got this warning: "We're sorry... ... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now." I guess they're right. At any moment I was about to try to download a few hundred epub's. -- Greg M. Johnson http://pterandon.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Mon Feb 1 23:09:29 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Tue, 2 Feb 2010 08:09:29 +0100 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: References: Message-ID: <3766AD4E-D2C0-41DB-BAC4-943A4DB3F2FE@uni-trier.de> Hi Greg, Am 02.02.2010 um 03:38 schrieb Greg M. Johnson: > I downloaded two epub's from Google Books and one or both of the book reading apps on my Android phone didn't even see one of them. You my have put the books on your phone. BUT does your Phone/reader know they are their?!!! On my Nokias I load music with thier tool from my Mac. But, I have to have the Player scan the phone for music to see them. Maybe you have to do that. Or maybe the reader on your Android needs some other files to see the books! Hope this helps. regards Keith. From marcello at perathoner.de Mon Feb 1 23:14:59 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 02 Feb 2010 08:14:59 +0100 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: References: Message-ID: <4B67D0F3.1040907@perathoner.de> Greg M. Johnson wrote: > I don't think that Google Books at least gets this. I spent so much time at > Google Books, browsing in apparently spider-like fashion, that I got this warning: > > > "We're sorry... > > ... but your computer or network may be sending automated queries. To protect > our users, we can't process your request right now." That may not be a quetion of getting `it? but of getting `hit?. gutenberg.org too gets hit by dozens of spiders a day, some of them sitting on big pipes and working with up to a hundred threads. While one of those spiders is at work, a human user can just about forget getting anything out of gutenberg.org because all server cycles are used to serve the spider. This is why gutenberg.org automatically denies access to IPs that make more than a certain amount of requests per hour. I think with Google the problem may be even worse than with gutenberg.org. -- Marcello Perathoner webmaster at gutenberg.org From pterandon at gmail.com Tue Feb 2 05:17:35 2010 From: pterandon at gmail.com (Greg M. Johnson) Date: Tue, 2 Feb 2010 08:17:35 -0500 Subject: [gutvol-d] Re: Formats and gripes Message-ID: From: "James Adcock" To: "'Project Gutenberg Volunteer Discussion'" Date: Mon, 1 Feb 2010 12:23:02 -0800 Subject: [gutvol-d] Re: Formats and gripes >>As most PG ebooks already contain a TOC inside the HTML, >> its pointless to generate another one. > > Sigh, you are going around in circles. The issue is that you > are generating MOBI files that do not correctly implement > the TOC standard of MOBI files. TOC is one thing. PG's epub file for "At the Earth's Core" (pg123.epub) shows up as under a list of "Unknown Authors" on my Android phone's FBReader (software recommended by PG). There's no title for it either in the display in one's Library. Once you open it, it appears to work well, even with a TOC! But is there something different about the way this text was prepared in comparison to, say the way the epub for "The Three Musketeers" was prepared? That shows up correctly with title and author. -- Greg M. Johnson http://pterandon.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Tue Feb 2 06:48:56 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 02 Feb 2010 15:48:56 +0100 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> <4B6731F7.8050503@perathoner.de> <4B67421F.7090500@perathoner.de> Message-ID: <4B683B58.90800@perathoner.de> Jim Adcock wrote: >> Not at all. The epub files I generate validate with epubcheck and the > TOC displays correctly on a ADE readers. > > You are leaving out the TOC element in the guide structure of the epub > files. While it is legal to do so in EPUB [not MOBI], that ADE displays a > "TOC" [actually the NCX guide structure] even when you are leaving out of > the guide can be considered "an extension" at best, a non-conforming > behavior of ADE at worse. From the epub spec: > Within the package there may be one guide element, > Reading Systems are not required to use the guide element in any way. The guide is optional on both sides, the publishing side and the consumer side. If Amazon makes it a requirement to have a guide in the epub they clearly didn't understand the spec. -- Marcello Perathoner webmaster at gutenberg.org From prosfilaes at gmail.com Tue Feb 2 10:33:05 2010 From: prosfilaes at gmail.com (David Starner) Date: Tue, 2 Feb 2010 13:33:05 -0500 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: <4B683B58.90800@perathoner.de> References: <4B65299A.7060304@perathoner.de> <4B6731F7.8050503@perathoner.de> <4B67421F.7090500@perathoner.de> <4B683B58.90800@perathoner.de> Message-ID: <6d99d1fd1002021033r50294c57j1dd8dd8180b8fd53@mail.gmail.com> On Tue, Feb 2, 2010 at 9:48 AM, Marcello Perathoner wrote: > The guide is optional on both sides, the publishing side and the consumer > side. If Amazon makes it a requirement to have a guide in the epub they > clearly didn't understand the spec. Clearly. You've been around for a while; you know that in practice there are optional features that are mandatory if you want decent support for the user. -- Kie ekzistas vivo, ekzistas espero. From jimad at msn.com Tue Feb 2 11:50:33 2010 From: jimad at msn.com (James Adcock) Date: Tue, 2 Feb 2010 11:50:33 -0800 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: <4B67D0F3.1040907@perathoner.de> References: <4B67D0F3.1040907@perathoner.de> Message-ID: In my experience and opinion, Google Books is designed to be overly paranoid about the spidering issue. I can spend 15 minutes there searching for interesting books without even downloading hardly any of the them, and then Google goes into paranoid mode, and starts requiring "Captcha" on everything I do. Also, the search algorithm, whatever it is, is bizarre. One day I can find a particular book, I come back the next day and enter the same search terms, and suddenly Google Books can't find it any more. Having said that, I find I can usually live with a Google Book that I find and am interested in -- either in the PDF format or the EPUB, it depends -- assuming I can't find a PG version of the book where a real human being has fixed the scannos! Someday maybe I'll even learn to live with the occasional thumb that shows up in my books! Certainly it is cool the ancient and obscure things one can find on Google Books. Not clear their efforts are really overall to the long-term benefit of society however. And there is a general problem that the more residual benefits citizens find in old books, then the more likely our "representatives" will take away our constitutional rights to read and share old books, and "sell" those rights back to ebook retailers like Google -- as has already happened in the millennium copyright laws, and/or DRM. From hart at pglaf.org Tue Feb 2 12:34:11 2010 From: hart at pglaf.org (Michael S. Hart) Date: Tue, 2 Feb 2010 12:34:11 -0800 (PST) Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: References: <4B67D0F3.1040907@perathoner.de> Message-ID: Well said!!! mh On Tue, 2 Feb 2010, James Adcock wrote: > In my experience and opinion, Google Books is designed to be overly paranoid > about the spidering issue. I can spend 15 minutes there searching for > interesting books without even downloading hardly any of the them, and then > Google goes into paranoid mode, and starts requiring "Captcha" on everything > I do. Also, the search algorithm, whatever it is, is bizarre. One day I > can find a particular book, I come back the next day and enter the same > search terms, and suddenly Google Books can't find it any more. Having said > that, I find I can usually live with a Google Book that I find and am > interested in -- either in the PDF format or the EPUB, it depends -- > assuming I can't find a PG version of the book where a real human being has > fixed the scannos! Someday maybe I'll even learn to live with the occasional > thumb that shows up in my books! Certainly it is cool the ancient and > obscure things one can find on Google Books. Not clear their efforts are > really overall to the long-term benefit of society however. And there is a > general problem that the more residual benefits citizens find in old books, > then the more likely our "representatives" will take away our constitutional > rights to read and share old books, and "sell" those rights back to ebook > retailers like Google -- as has already happened in the millennium copyright > laws, and/or DRM. > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From hart at pglaf.org Tue Feb 2 12:35:45 2010 From: hart at pglaf.org (Michael S. Hart) Date: Tue, 2 Feb 2010 12:35:45 -0800 (PST) Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: References: Message-ID: Well said!!! I should have posted this earlier. . .and mentioned thatI asked permission to use this, forward it, etc., in the future. . . . Michael On Mon, 1 Feb 2010, Greg M. Johnson wrote: > I downloaded two epub's from Google Books and one or both of the book reading apps on my > Android phone didn't even see one of them. > > I think that some of these collections are designed with the idea that the repository > should be on the web, and you the reader should go search the web interface to find a > book you want, then download that one book, have perfect confidence it's going to be cool > to read and functioning properly, then maybe you'll go on to the next one a few days > later. > > I don't think humans work that way.? First of all web interfaces, especially on a phone, > are inherently slow, and sometimes unavailable either due to wifi/ 3G coverage or due to > embarassment about using "work bandwidth".? The Google Books interface isn't *bad*, but > it's still like being fed at a gourmet banquet with a baby spoon.? ? The user may have > one bad experience with a downloaded text, no matter how small, and they want to curate > their own collection first, maybe hoard-up more books than they or their family could > read in a lifetime, cull out the icky or malfunctioning texts, and then have say 20 on > their reader and 2000 on a DVD in a safe in their basement. At least that's how I respond > to having one or two minor problems.? ;) > > I don't think that Google Books at least gets this. I spent so much time at Google Books, > browsing in apparently spider-like fashion, that I got this warning: > "We're sorry... > > ... but your computer or network may be sending automated queries. To protect our users, > we can't process your request right now." > > I guess they're right. At any moment I was about to try to download a few hundred epub's. > > > -- > Greg M. Johnson > http://pterandon.blogspot.com > > From Bowerbird at aol.com Tue Feb 2 15:59:55 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 2 Feb 2010 18:59:55 EST Subject: [gutvol-d] roundlessness -- 002 Message-ID: <56f4.5ddd8802.389a167b@aol.com> we're looking at rfrank's "roundless" experiment at fadedpage.com... as i said yesterday, this test is a very very very very very good thing, because distributed proofreaders has been bogged down in a morass of "rounds" for many years now. their standard workflow now calls for _three_ rounds of proofing, followed by _two_ rounds of formatting... throw in a "preprocessing" round, and their "postprocessing", which is following by "postprocessing verification", and you've got 8 rounds. i don't know about you, but to me, that seems like a lot... but that's not the worst of it. the worst is the resultant backlogs... the problem arises because d.p. has thousands of proofers doing p1 (the first round of proofing), but d.p. only has hundreds that do p2 (the second round), and mere _dozens_ doing p3 ("final" proofing)... needless to say, the large number of proofers doing p1 can proof more than the smaller number doing p2, or the tiny number in p3. the backlog created is (understandably) frustrating and demoralizing for the proofers trying to keep up in p2, and is killing the p3 proofers. there is also the gnawing feeling that not all pages _need_ 3 rounds. indeed, _most_ pages in _most_ books are simple enough that they can be finished in one round, two at the most. so the _inefficiency_ of the 3-round proofing is rather striking as well. the thought is that each page should be proofed only as many times as that page needs; this has been labeled as a "roundless" system. aside from the backlogs of partially-done material, the other sign of a problem with the dp.p. workflow is that production has flattened... even though d.p. enjoys a constant stream of incoming volunteers, thanks to all of the good-will that project gutenberg's free e-books have generated over the years, d.p. output has leveled out at under 250 books per month, which works out to less than 3,000 per year. against the backdrop of the _millions_ of books google has scanned, this is a mere drop in the bucket. a small drop in a very large bucket. rfrank doesn't go into all of this on his site. perhaps he didn't need to, since the d.p. people he's recruited are well-acquainted with the issues. but rfrank is also unclear on many of the details of his little experiment, which is a more worrying matter. specifically, i don't see a lot of experimental rigor here. it seems to me that roger is unfamiliar with the mechanics of the scientific method and its applicability to human social experiments. i see no evidence of any stated hypotheses, nor any way such hypotheses can be disconfirmed... the reason people developed the scientific method was because we found that when we just fooled around "to see how things turn out", we often ended up fooling ourselves about what we had seen, and what it meant. we learned that we had to actually specify our hypotheses, and devise tests (experiments) specifically designed to disconfirm our hypotheses. otherwise, our brains are only too willing to accommodate what we find as being "supportive" of our initial impressions. ("experimenter bias" is the term by which this insidious phenomenon is most well-known.) if i'm correct, this problem will surface in rfrank's future results, and surface repeatedly, so there's no need for me to labor the point now. but i wanted to frame this particular issue, here and now, in advance. that's enough for today. see you tomorrow... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Tue Feb 2 16:02:49 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 2 Feb 2010 16:02:49 -0800 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: <4B683B58.90800@perathoner.de> References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> <4B6731F7.8050503@perathoner.de> <4B67421F.7090500@perathoner.de> <4B683B58.90800@perathoner.de> Message-ID: >The guide is optional on both sides, the publishing side and the consumer side. If Amazon makes it a requirement to have a guide in the epub they clearly didn't understand the spec. Amazon doesn't make it a requirement to have a guide in epub, they make it a requirement to have a guide in mobi. Both epub and mobi can be made from OPF, they just have slightly different requirements on that OPF file set. You could easily generate the set of files required for epub, generate that file format, then add the one extra file required for a conforming mobi -- which is just a slightly different syntax than the ngx file, add one link statement in the opf, and recompile the set of files for a fully conforming mobi. But instead you blame Amazon for the fact that YOU are choosing to make files that will not work correctly on the majority of e-book readers being sold in the market. You could easily make them work if you wanted to, but you don't want them to work. Other web sites for books including sites for free books using basically the same set of tools that you are using, instead of making excuses and finger-pointing ARE making files that work correctly on the majority of e-book readers being sold in the market. It's not like this is a whole lot of work for you one way or another. It's just that you WANT to pimp the files you are making for Kindles. From jimad at msn.com Tue Feb 2 16:13:02 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 2 Feb 2010 16:13:02 -0800 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: <4B672C4E.8030600@perathoner.de> References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> <4B672C4E.8030600@perathoner.de> Message-ID: >Now if Amazon would release their "Kindle for PC" for Linux too, I could actually check the generated files... knowing that the Kindle runs on Linux where's the big holdup? Don't know other than presumably not enough people in the word run Linux to make it worth their while. You CAN however use Linux to install Mobipocket's free mobile device compatible Reader -- Mobipocket being part of Amazon -- said reader supports about 50 different popular mobile devices. The Mobipocket Reader will also allow you to confirm the fact that you are not adding a conforming TOC to your mobi files. Read: http://www.mobipocket.com/en/DownloadSoft/ProductDetailsReader.asp and look for the little penguin on the right hand side of the page. From jimad at msn.com Tue Feb 2 17:33:01 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 2 Feb 2010 17:33:01 -0800 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: <56f4.5ddd8802.389a167b@aol.com> References: <56f4.5ddd8802.389a167b@aol.com> Message-ID: >...killing the p3 proofers. The problem is worse: under the pressure to produce, and having become "jaded" the p3'ers apparently do not bother to even look at the digitized images of the author's text but rather assume that they know best and introduce changes which are other than what the author wrote. There is also the problem of "false positives" -- once the errors left in the text become infrequent-enough the human mind wants to make changes to "show you're making a positive contribution" even when there was no error there that the P3'ers ought to be fixing. But even the p3 problem is nothing compared to the wait time in post-processing, where things can get hung up for about literally another year. If PG were able to easily accept a txt file now and the html version (and other versions later) not only would readers get some books a year earlier, but we could probably save some efforts that die and get lost somewhere between txt complete and html complete. Why does posting have to happen "all at once" ??? From gbnewby at pglaf.org Tue Feb 2 17:44:12 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Tue, 2 Feb 2010 17:44:12 -0800 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: References: <56f4.5ddd8802.389a167b@aol.com> Message-ID: <20100203014412.GA26584@pglaf.org> On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote: > ... > If PG were able to easily accept a txt file now and the html version (and > other versions later) not only would readers get some books a year earlier, > but we could probably save some efforts that die and get lost somewhere > between txt complete and html complete. Why does posting have to happen "all > at once" ??? It doesn't. In fact, "extracting" works from DP earlier was a big push I made a couple of years ago. At that time, such two stage (or other great-than-one stage) output was something that didn't fit well with the workflow. Maybe that's something that could be revisited. It's important to not double the effort involved at the final posting phase (whitewashing) through such a two stage process. But there are several good ways of insuring this, which could be incorporated with the process. There is definitely flexibility. -- Greg From dakretz at gmail.com Tue Feb 2 18:00:48 2010 From: dakretz at gmail.com (don kretz) Date: Tue, 2 Feb 2010 18:00:48 -0800 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: <20100203014412.GA26584@pglaf.org> References: <56f4.5ddd8802.389a167b@aol.com> <20100203014412.GA26584@pglaf.org> Message-ID: <627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com> That's real good news, Greg, especially if you're talking about flexibility on the DP side. 100% of the responsibility for evaluating and recommending changes to the DP process has been apparently relegated to the DP Board of Directors. Since you are one of the five directors, you're in the know if anyone is. Since you represent 20% of the horsepower responsible for coming up with those changes, I trust you've been busy. On Tue, Feb 2, 2010 at 5:44 PM, Greg Newby wrote: > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote: > > ... > > If PG were able to easily accept a txt file now and the html version (and > > other versions later) not only would readers get some books a year > earlier, > > but we could probably save some efforts that die and get lost somewhere > > between txt complete and html complete. Why does posting have to happen > "all > > at once" ??? > > It doesn't. In fact, "extracting" works from DP earlier was a big push > I made a couple of years ago. At that time, such two stage (or other > great-than-one stage) output was something that didn't fit well with > the workflow. Maybe that's something that could be revisited. > > It's important to not double the effort involved at the final posting > phase (whitewashing) through such a two stage process. But there are > several good ways of insuring this, which could be incorporated with > the process. > > There is definitely flexibility. > > -- Greg > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbnewby at pglaf.org Tue Feb 2 18:20:55 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Tue, 2 Feb 2010 18:20:55 -0800 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: <627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com> References: <56f4.5ddd8802.389a167b@aol.com> <20100203014412.GA26584@pglaf.org> <627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com> Message-ID: <20100203022055.GA28054@pglaf.org> On Tue, Feb 02, 2010 at 06:00:48PM -0800, don kretz wrote: > That's real good news, Greg, especially if you're talking about flexibility > on > the DP side. 100% of the responsibility for evaluating and recommending > changes to the DP process has been apparently relegated to the DP Board > of Directors. I don't think that was the intention of the (relatively) new Board and new GM. The Board has ideas, but isn't trying to manage day to day activity. > Since you are one of the five directors, you're in the know if anyone is. > Since > you represent 20% of the horsepower responsible for coming up with those > changes, I trust you've been busy. Indeed, but actually we have not been looking at this level of detail for changes in the DP processing chain. The Board isn't to micromange, and isn't to get in the way of progress. That said, if you think there are proposals, ideas for change, etc. that are not getting the attention they deserve, I would be happy to bring them to the board (or GM, as appropriate) on anyone's behalf, anonymously if desired. -- Greg > On Tue, Feb 2, 2010 at 5:44 PM, Greg Newby wrote: > > > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote: > > > ... > > > If PG were able to easily accept a txt file now and the html version (and > > > other versions later) not only would readers get some books a year > > earlier, > > > but we could probably save some efforts that die and get lost somewhere > > > between txt complete and html complete. Why does posting have to happen > > "all > > > at once" ??? > > > > It doesn't. In fact, "extracting" works from DP earlier was a big push > > I made a couple of years ago. At that time, such two stage (or other > > great-than-one stage) output was something that didn't fit well with > > the workflow. Maybe that's something that could be revisited. > > > > It's important to not double the effort involved at the final posting > > phase (whitewashing) through such a two stage process. But there are > > several good ways of insuring this, which could be incorporated with > > the process. > > > > There is definitely flexibility. > > > > -- Greg > > _______________________________________________ > > gutvol-d mailing list > > gutvol-d at lists.pglaf.org > > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > From ke at gnu.franken.de Tue Feb 2 21:28:32 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Wed, 03 Feb 2010 06:28:32 +0100 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: (James Adcock's message of "Tue, 2 Feb 2010 11:50:33 -0800") References: <4B67D0F3.1040907@perathoner.de> Message-ID: "James Adcock" writes: > One day I can find a particular book, I come back the next day and > enter the same search terms, and suddenly Google Books can't find it > any more. So what? If the environment changes (more books, new reviews, external linking, etc.), yesterdays assumptions could be different or even "wrong" today. Sidenote: It is the same with the idea of the iso-8859-1 (or ASCII for languages, that require more characters) version of books. These days everything should be UTF-8 encoded by default. The ASCII idea was fine some twenty years ago, but today it is time for change. On gutenberg you cannot find most books at all! They simply do not exist in our cosmos. And waht's worse, even the important books are mostly missing or weakly done. I'm hapy that google offers all these books. If one issue has defects, chances are high that there is another copy in the Google cache that you can use as a remedy. -- Karl Eichwalder From dakretz at gmail.com Tue Feb 2 21:43:07 2010 From: dakretz at gmail.com (don kretz) Date: Tue, 2 Feb 2010 21:43:07 -0800 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: <20100203022055.GA28054@pglaf.org> References: <56f4.5ddd8802.389a167b@aol.com> <20100203014412.GA26584@pglaf.org> <627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com> <20100203022055.GA28054@pglaf.org> Message-ID: <627d59b81002022143k3582d0fam473fcd4a01523749@mail.gmail.com> And on the other end we're hearing the same thing - the GM is there only to manage, and initiative for change will come from the Board. I'm absolutely not suggesting the Board is or should be micro or macro managing. I think everyone is expecting that the Board is about Planning. You're not? You disagree? On Tue, Feb 2, 2010 at 6:20 PM, Greg Newby wrote: > On Tue, Feb 02, 2010 at 06:00:48PM -0800, don kretz wrote: > > That's real good news, Greg, especially if you're talking about > flexibility > > on > > the DP side. 100% of the responsibility for evaluating and recommending > > changes to the DP process has been apparently relegated to the DP Board > > of Directors. > > I don't think that was the intention of the (relatively) new Board and > new GM. The Board has ideas, but isn't trying to manage day to day > activity. > > > Since you are one of the five directors, you're in the know if anyone is. > > Since > > you represent 20% of the horsepower responsible for coming up with those > > changes, I trust you've been busy. > > Indeed, but actually we have not been looking at this level > of detail for changes in the DP processing chain. The Board > isn't to micromange, and isn't to get in the way of progress. > > That said, if you think there are proposals, ideas for change, > etc. that are not getting the attention they deserve, I would > be happy to bring them to the board (or GM, as appropriate) on > anyone's behalf, anonymously if desired. > > -- Greg > > > On Tue, Feb 2, 2010 at 5:44 PM, Greg Newby wrote: > > > > > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote: > > > > ... > > > > If PG were able to easily accept a txt file now and the html version > (and > > > > other versions later) not only would readers get some books a year > > > earlier, > > > > but we could probably save some efforts that die and get lost > somewhere > > > > between txt complete and html complete. Why does posting have to > happen > > > "all > > > > at once" ??? > > > > > > It doesn't. In fact, "extracting" works from DP earlier was a big push > > > I made a couple of years ago. At that time, such two stage (or other > > > great-than-one stage) output was something that didn't fit well with > > > the workflow. Maybe that's something that could be revisited. > > > > > > It's important to not double the effort involved at the final posting > > > phase (whitewashing) through such a two stage process. But there are > > > several good ways of insuring this, which could be incorporated with > > > the process. > > > > > > There is definitely flexibility. > > > > > > -- Greg > > > _______________________________________________ > > > gutvol-d mailing list > > > gutvol-d at lists.pglaf.org > > > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ke at gnu.franken.de Tue Feb 2 23:01:40 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Wed, 03 Feb 2010 08:01:40 +0100 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: <20100203014412.GA26584@pglaf.org> (Greg Newby's message of "Tue, 2 Feb 2010 17:44:12 -0800") References: <56f4.5ddd8802.389a167b@aol.com> <20100203014412.GA26584@pglaf.org> Message-ID: Greg Newby writes: > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote: > It doesn't. In fact, "extracting" works from DP earlier was a big push > I made a couple of years ago. At that time, such two stage (or other > great-than-one stage) output was something that didn't fit well with > the workflow. Maybe that's something that could be revisited. I'm all for it. In the DP forum, I proposed this several times. > It's important to not double the effort involved at the final posting > phase (whitewashing) through such a two stage process. But there are > several good ways of insuring this, which could be incorporated with > the process. Could we give this a try with manually selected books first? How can we make sure that we do not waste the whitewashers' time? -- Karl Eichwalder From traverso at posso.dm.unipi.it Tue Feb 2 23:18:27 2010 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Wed, 3 Feb 2010 08:18:27 +0100 (CET) Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: (message from Karl Eichwalder on Wed, 03 Feb 2010 08:01:40 +0100) References: <56f4.5ddd8802.389a167b@aol.com> <20100203014412.GA26584@pglaf.org> Message-ID: <20100203071827.50C00FFB2@cardano.dm.unipi.it> While we are at it, could we consider a revision of the requirements for the PG txt files? Allowing a bit more of flexibility (for example, allow to preserve the original line and page breaks) and possibly with the availability of the page images will improve considerably the maintenance of the files and the addition of new versions. Carlo From marcello at perathoner.de Tue Feb 2 23:21:14 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 03 Feb 2010 08:21:14 +0100 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> <4B6731F7.8050503@perathoner.de> <4B67421F.7090500@perathoner.de> <4B683B58.90800@perathoner.de> Message-ID: <4B6923EA.6050700@perathoner.de> Jim Adcock wrote: > But instead you blame Amazon for the fact that YOU are choosing to make > files that will not work correctly on the majority of e-book readers being > sold in the market. You could easily make them work if you wanted to, but > you don't want them to work. If I didn't want them to work I wouldn't generate them in the first time. To recap for the last time: 1. I do generate epub files that pass epubcheck and display correctly on ADE mobile readers. 2. Amazon provides a converter "kindlegen" that claims to convert epub files into their proprietary mobi format. 3. kindlegen fumbles the perfectly valid toc that is inside my epubs and generates a mobi file without toc (your claim). 4. You tell me that I should volunteer more unpaid time to work around a bug in Amazon's converter, reverse engineer their closed proprietary format for which they provide no documentation maybe and test it on a dozen devices that I should buy out of my own pocket. IMHO you should bugger the people that chose to make the format proprietary, to not document it in any way and on top of that release buggy converter software. Remember another textbook example: that Internet Explorer even in its 8th incarnation still does not follow w3c standards. And why is that possible? Because developers all over the world chose to work around Microsofts bugs instead of forcing them to fix their software. I'm not going down that slippery slope: If I would I'd spend more time working around other people's bugs than writing new functionality. But YOU are perfectly free to volunteer your time to save Amazon some bucks: Take my epubs, patch them, and convert them to mobis that display the toc when you hit the toc button, and redistribute them on your site. -- Marcello Perathoner webmaster at gutenberg.org From frank.vandrogen at bc.biol.ethz.ch Tue Feb 2 23:28:13 2010 From: frank.vandrogen at bc.biol.ethz.ch (van Drogen Frank) Date: Wed, 3 Feb 2010 08:28:13 +0100 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: <56f4.5ddd8802.389a167b@aol.com> References: <56f4.5ddd8802.389a167b@aol.com> Message-ID: <5B4C3A336FC71D4495CB3318A111D285042A23A7@EX2.d.ethz.ch> > i see no evidence of any > stated hypotheses, nor any way such hypotheses can be disconfirmed... > > the reason people developed the scientific method was because we found > that when we just fooled around "to see how things turn out", we often > ended up fooling ourselves about what we had seen, and what it meant. Not quite familiar with modern advances in sciences, I recon. Now-a-days it seems we're supposed to look at systems as a whole, instead of doing hypothesis drivin experiments (at least, granting agencies seem to think so). Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Feb 3 02:24:11 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Feb 2010 05:24:11 EST Subject: [gutvol-d] Re: roundlessness -- 002 Message-ID: get yer own thread! -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Wed Feb 3 08:53:08 2010 From: dakretz at gmail.com (don kretz) Date: Wed, 3 Feb 2010 08:53:08 -0800 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: <4B6923EA.6050700@perathoner.de> References: <4B6731F7.8050503@perathoner.de> <4B67421F.7090500@perathoner.de> <4B683B58.90800@perathoner.de> <4B6923EA.6050700@perathoner.de> Message-ID: <627d59b81002030853k15255e3bkeb3e43f02b91bed9@mail.gmail.com> > > > I'm not going down that slippery slope: If I would I'd spend more time > working around other people's bugs than writing new functionality. > > But YOU are perfectly free to volunteer your time to save Amazon some > bucks: Take my epubs, patch them, and convert them to mobis that display the > toc when you hit the toc button, and redistribute them on your site. > > -- > Marcello Perathoner > webmaster at gutenberg.org I had a car like this once. The turn signal was on the right side of the steering column. The headlight dimmer was on the left side. The window winders worked backwards. The inside door locks would lock when you pulled them up, and unlock when you pushed them down. An iconoclastic car' which was one reason I liked it. No concessions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Feb 3 10:05:01 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Feb 2010 13:05:01 EST Subject: [gutvol-d] Re: roundlessness -- 002 Message-ID: frank said: > Now-a-days it seems we're supposed to look at systems > as a whole, instead of doing hypothesis drivin experiments > (at least, granting agencies seem to think so). i believe that as this series continues, my point will become crystal-clear. in the absence of such clarity, or if you think your point continues to have some merit, frank, please do make it in a more specific manner later on... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Feb 3 10:17:08 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Feb 2010 13:17:08 EST Subject: [gutvol-d] Re: Formats and gripes Message-ID: dakretz said: > I had a car like this once. The turn signal > was on the right side of the steering column. > The headlight dimmer was on the left side. > The window winders worked backwards. > The inside door locks would lock when you pulled them up, > and unlock when you pushed them down. > An iconoclastic car' which was one reason I liked it. funny how much we are willing to deviate from "the standard" when we are making the decision to do so for our own reasons, and how unwilling we are to do so when someone else asks us... marcello would jump through all kinds of hoops to make his own preferred formats work, but he won't do jack shit for anyone else. if you're not gonna make a mobi version that runs on the kindle, there isn't much sense in making a mobi version at all, is there? but like all technocrats, marcello is great at displacing the blame: "if it doesn't work for you, it must be your fault. not my problem." and even if another "project gutenberg volunteer" were to _solve_ this particular problem, i doubt marcello would mount the solution. i don't know who gave him this power to decide what gets blessed and what doesn't, but i wish they would now take it away from him. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From joey at joeysmith.com Wed Feb 3 12:11:04 2010 From: joey at joeysmith.com (Joey Smith) Date: Wed, 3 Feb 2010 13:11:04 -0700 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: References: <4B67D0F3.1040907@perathoner.de> Message-ID: <20100203201104.GA957@joeysmith.com> On Wed, Feb 03, 2010 at 06:28:32AM +0100, Karl Eichwalder wrote: [snip] > On gutenberg you cannot find most books at all! They simply do not > exist in our cosmos. And waht's worse, even the important books are > mostly missing or weakly done. > > I'm hapy that google offers all these books. If one issue has defects, > chances are high that there is another copy in the Google cache that > you can use as a remedy. Do you have a list of these "important books" that PG is missing but which are available in Google Books? From dakretz at gmail.com Wed Feb 3 12:23:03 2010 From: dakretz at gmail.com (don kretz) Date: Wed, 3 Feb 2010 12:23:03 -0800 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: <20100203201104.GA957@joeysmith.com> References: <4B67D0F3.1040907@perathoner.de> <20100203201104.GA957@joeysmith.com> Message-ID: <627d59b81002031223v67963d79u3998df223b8d8856@mail.gmail.com> In fact, DP recently had an active discussion about trying harder to work from lists of "important books" not yet on PG. This would be very helpful. On Wed, Feb 3, 2010 at 12:11 PM, Joey Smith wrote: > On Wed, Feb 03, 2010 at 06:28:32AM +0100, Karl Eichwalder wrote: > > [snip] > > > On gutenberg you cannot find most books at all! They simply do not > > exist in our cosmos. And waht's worse, even the important books are > > mostly missing or weakly done. > > > > I'm hapy that google offers all these books. If one issue has defects, > > chances are high that there is another copy in the Google cache that > > you can use as a remedy. > > Do you have a list of these "important books" that PG is missing but which > are available in Google Books? > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Wed Feb 3 13:23:59 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 03 Feb 2010 22:23:59 +0100 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: <20100203201104.GA957@joeysmith.com> References: <4B67D0F3.1040907@perathoner.de> <20100203201104.GA957@joeysmith.com> Message-ID: <4B69E96F.9020508@perathoner.de> Joey Smith wrote: > On Wed, Feb 03, 2010 at 06:28:32AM +0100, Karl Eichwalder wrote: > > [snip] > >> On gutenberg you cannot find most books at all! They simply do not >> exist in our cosmos. And waht's worse, even the important books are >> mostly missing or weakly done. >> >> I'm hapy that google offers all these books. If one issue has defects, >> chances are high that there is another copy in the Google cache that >> you can use as a remedy. > > Do you have a list of these "important books" that PG is missing but which > are available in Google Books? Marx's Kapital Freud's Traumdeutung Russell's Principia Mathematica Grey's Anatomy ... just a few off the top of my head. -- Marcello Perathoner webmaster at gutenberg.org From ke at gnu.franken.de Wed Feb 3 22:53:08 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Thu, 04 Feb 2010 07:53:08 +0100 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: <4B69E96F.9020508@perathoner.de> (Marcello Perathoner's message of "Wed, 03 Feb 2010 22:23:59 +0100") References: <4B67D0F3.1040907@perathoner.de> <20100203201104.GA957@joeysmith.com> <4B69E96F.9020508@perathoner.de> Message-ID: Marcello Perathoner writes: >> Do you have a list of these "important books" that PG is missing but which >> are available in Google Books? > > Marx's Kapital > Freud's Traumdeutung > Russell's Principia Mathematica > Grey's Anatomy > > ... just a few off the top of my head. Yes, and not a single text by by Novalis, just two books by Fontane, ditto by W. Raabe, three by Stifter, 1 text by Jean Paul. I think there is nearly a single German edition of poems of the Middle Ages (say, Walther von der Vogelweide). And litterature about these topics is also rather rare--Goggle offers tons of those. All the German Journals--there is basically nothing available from gutenberg.org. I'm not sure whether there are still broken LOTE editions at gutenberg.org, where you simple replaced umlauts with "machting" letters (? -> a) for the sake of clean ASCII text... I do not blame us because of these defiancies, but please treat competitors respectfully, etc. pp. -- Karl Eichwalder From traverso at posso.dm.unipi.it Thu Feb 4 01:49:17 2010 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Thu, 4 Feb 2010 10:49:17 +0100 (CET) Subject: [gutvol-d] autogenerated HTML Message-ID: <20100204094917.A23DDFFB5@cardano.dm.unipi.it> I have just become aware that PG now autogenerates HTML for texts that don't have it. Unfortunately however sometimes the autogenerated file is garbage (e.g. poetry rewrapped, see 31079). Would it be possible to have the autogeneration program to find what is the problem, or at least to preview the autogenerated file and possibly fix either the program or the files? Carlo From Bowerbird at aol.com Thu Feb 4 02:39:09 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Feb 2010 05:39:09 EST Subject: [gutvol-d] Re: autogenerated HTML Message-ID: <1108.22585ed6.389bfdcd@aol.com> carlo said: > I have just become aware that PG now > autogenerates HTML for texts that don't have it. since a number of people are fond of rewriting history here, let's note for the record that i suggested this some time ago. indeed, my recommendation was that the .txt version should be used to autogenerate the .html version for _all_ the books, that hand-crafted .html be abandoned because it is too hard to maintain and to upgrade. i also suggested that conformance to this strategy would enable p.g. to improve the .txt versions... and i predicted that sooner or later, you'd all come around to this workflow. and how you have. so i will say "i told you so." > Unfortunately however sometimes the autogenerated file > is garbage (e.g. poetry rewrapped, see 31079). without even looking at those files, i can guess what's wrong... many of the books that are exclusively poetry are set flush to the left margin, lacking any of the leading spaces that serve as a signal to the conversion program not to wrap the lines... so of course the converter is gonna wrap the lines. this is an error, a major error, in the processing of these books. (and it's so easy to change every linebreak to a linebreak+space.) > Would it be possible to have the autogeneration program to find > what is the problem, or at least to preview the autogenerated file > and possibly fix either the program or the files? i've never tried to verify it with a closer analysis, but my impression is that some of the whitewashers use a slightly different converter... and then of course there are a number of different ones over at d.p., including the one in thundercat's app, and another by david garcia... without dedication to making the .txt program correct at the outset, however, it doesn't matter how good the converter might be... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Thu Feb 4 10:15:53 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 04 Feb 2010 19:15:53 +0100 Subject: [gutvol-d] Re: autogenerated HTML In-Reply-To: <20100204094917.A23DDFFB5@cardano.dm.unipi.it> References: <20100204094917.A23DDFFB5@cardano.dm.unipi.it> Message-ID: <4B6B0ED9.2080304@perathoner.de> Carlo Traverso wrote: > I have just become aware that PG now autogenerates HTML for texts that > don't have it. Unfortunately however sometimes the autogenerated file > is garbage (e.g. poetry rewrapped, see 31079). Would it be possible to > have the autogeneration program to find what is the problem, or at > least to preview the autogenerated file and possibly fix either the > program or the files? http://www.gutenberg.org/tools/epubmaker-0.02-preview-2009-11-26.tgz Look into parsers/GutenbergTextParser.py -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Fri Feb 5 12:45:33 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 5 Feb 2010 15:45:33 EST Subject: [gutvol-d] Re: autogenerated HTML Message-ID: <126a8.62677071.389ddd6d@aol.com> i said: > without even looking at those files, i can guess what's wrong... > > many of the books that are exclusively poetry are set flush to > the left margin, lacking any of the leading spaces that serve > as a signal to the conversion program not to wrap the lines... > > so of course the converter is gonna wrap the lines. > > this is an error, a major error, in the processing of these books. > > (and it's so easy to change every linebreak to a linebreak+space.) well, gee, i should have looked at those files earlier, because when i did get around to looking at them, i got a good laugh. see, the lines _were_ indented, so shouldn't have been wrapped. and wouldn't have been wrapped if most of the converters around would've been used. but i have since learned that there is an "experimental" converter, programmed by marcello, and that's what was used for this book. the irony of this gave me a big hearty laugh. you see, when i laid out the philosophy of z.m.l. here on this list, some of you will remember how much crap marcello threw at me. the guy was relentless. even though he almost never made _any_ posts to the list otherwise, he would respond negatively to anything i said. but never constructively. he'd just throw out pure bullcrap... for a long time i responded, just to clearly specify that it was crap; but after a while i decided to let his crap speak for its crappy self, and i stopped responding. after that, he stopped responding to me. (i guess that thing they say about not feeding the trolls is right on.) but if you wanna see what an ass he was, you can check the archives. you can also go look at a website he set up with a lot of quotes from messages i sent here. most of them are taken out of context, sure, but even then i stand behind them. you'll see that they were correct. (that's right, he set up a _fan_ page, on his web-site, to ridicule me; i don't know where the guy is coming from, but i think he should get a life.) anyway, marcello insisted that my approach was bunk, and that one could not successfully generate a full-on .html file from a .txt version. yet now he is writing code to do just that. (he's not doing it _successfully_ yet, so in regard to _himself_ alone, i guess he was right. but i have been successful for a long time now.) what's especially ironic -- and funny to me -- is that marcello is now suffering through the same complications that i experienced, namely maddeningly inconsistent .txt files, which he must program around, just like i did. further, this will lead him to the same conclusion that i came to, which is that this would be _much_ simpler if only the rules that p.g. has already established for .txt files were simply _followed_... even more irony, and thus even more humor? marcello is programming in _python_, a language where indentation is _meaningful_. this is ironic, and funny, because one of the things about z.m.l. which marcello once tried to lambaste is that whitespace is meaningful. every time he does an indent, i hope he chokes on it. meanwhile, i just have one thing to say to all you z.m.l. naysayers: i told you so. you were wrong. i was right. i hope you choke on it. (wait, is that one thing, or 4 things? oh well, guess it doesn't matter.) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From pterandon at gmail.com Fri Feb 5 17:25:11 2010 From: pterandon at gmail.com (Greg M. Johnson) Date: Fri, 5 Feb 2010 20:25:11 -0500 Subject: [gutvol-d] Re: psychology of interactimg with ebooks In-Reply-To: References: Message-ID: <> IMNSHO, another fruit of making the TXT-80 format the default standard. The programming will ne a headache if one can't just delete *all*the CR's. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Feb 5 18:04:49 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 5 Feb 2010 21:04:49 EST Subject: [gutvol-d] Re: psychology of interactimg with ebooks Message-ID: <1b1f0.5792aa2e.389e2841@aol.com> greg said: > The programming will ne a headache > if one can't just delete *all*the CR's. nah. programming an unwrap routine is quite easy... > http://z-m-l.com/go/unwrap.pl -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sat Feb 6 05:35:38 2010 From: prosfilaes at gmail.com (David Starner) Date: Sat, 6 Feb 2010 08:35:38 -0500 Subject: [gutvol-d] Re: psychology of interactimg with ebooks In-Reply-To: References: Message-ID: <6d99d1fd1002060535r137a7312y9c22bc357e1ecc09@mail.gmail.com> On Fri, Feb 5, 2010 at 8:25 PM, Greg M. Johnson wrote: > <> > > IMNSHO,? another fruit of making the TXT-80 format the default standard. > The programming will ne a headache if one can't just delete *all*the CR's. But it's done; we can no more change it now than Compuserve can change the format of GIF files. -- Kie ekzistas vivo, ekzistas espero. From pterandon at gmail.com Sat Feb 6 05:50:06 2010 From: pterandon at gmail.com (Greg M. Johnson) Date: Sat, 6 Feb 2010 08:50:06 -0500 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: References: Message-ID: Marcello's story about big-pipe servers settling on pg makes me think of a Borg spaceship passing over a peaceful little village. At the risk of sounding like Eleanor Clift's response to the Soviets in "The Watchmen" movie, I'll ask: So what on earth are these big-pipe servers doing? Are they generating their own independent collection in case of a collapse of the internet? Are they engaged in some really inefficient search algorithm that requires opening every single file? Are they some Google wannabe who's indexing your site? Is it malicious mischief/ DOS? Or, is it a case of an "honest" (if cluelessly implemented) demand that could be met with some more products that could be torrented. Could that entity be looking for a MOBI of the top 1000 books, and EPUB of everything in the German language? ---------- Forwarded message ---------- > From: Marcello Perathoner > To: Project Gutenberg Volunteer Discussion > Date: Tue, 02 Feb 2010 08:14:59 +0100 > Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. > Greg M. Johnson wrote: > > I don't think that Google Books at least gets this. I spent so much time >> at Google Books, browsing in apparently spider-like fashion, that I got this >> warning: >> >> >> "We're sorry... >> >> ... but your computer or network may be sending automated queries. To >> protect our users, we can't process your request right now." >> > > That may not be a quetion of getting `it? but of getting `hit?. > > gutenberg.org too gets hit by dozens of spiders a day, some of them > sitting on big pipes and working with up to a hundred threads. > > While one of those spiders is at work, a human user can just about forget > getting anything out of gutenberg.org because all server cycles are used > to serve the spider. > > This is why gutenberg.org automatically denies access to IPs that make > more than a certain amount of requests per hour. > > I think with Google the problem may be even worse than with gutenberg.org. > -- > Marcello Perathoner > -- Greg M. Johnson http://pterandon.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Sat Feb 6 10:00:43 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 06 Feb 2010 19:00:43 +0100 Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. In-Reply-To: References: Message-ID: <4B6DAE4B.7000007@perathoner.de> Greg M. Johnson wrote: > Marcello's story about big-pipe servers settling on pg makes me think of a > Borg spaceship passing over a peaceful little village. > > At the risk of sounding like Eleanor Clift's response to the Soviets in "The > Watchmen" movie, I'll ask: > So what on earth are these big-pipe servers doing? Most of them are collecting innocent-looking phrases to inject into spam mails. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Sat Feb 6 14:31:47 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 6 Feb 2010 17:31:47 EST Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks. Message-ID: <2dc9e.665ac718.389f47d3@aol.com> greg said: > So what on earth are these big-pipe servers doing? collecting information, so as to assemble it, and reassemble it, thereby producing new information that might change the world. or make money. lots of money. lots and lots and lots of money... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbnewby at pglaf.org Sat Feb 6 16:18:50 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Sat, 6 Feb 2010 16:18:50 -0800 Subject: [gutvol-d] Technical measures? (Re: Re: Psychology of interacting with (Google's)) ebooks. In-Reply-To: <4B6DAE4B.7000007@perathoner.de> References: <4B6DAE4B.7000007@perathoner.de> Message-ID: <20100207001850.GD14117@pglaf.org> On Sat, Feb 06, 2010 at 07:00:43PM +0100, Marcello Perathoner wrote: > Greg M. Johnson wrote: > >Marcello's story about big-pipe servers settling on pg makes me > >think of a Borg spaceship passing over a peaceful little village. > > > >At the risk of sounding like Eleanor Clift's response to the > >Soviets in "The Watchmen" movie, I'll ask: > >So what on earth are these big-pipe servers doing? > > Most of them are collecting innocent-looking phrases to inject into > spam mails. Did you ever look into mod_evasive or a similar approach? It's a good way of automatically shutting down abusers. Takes some tuning (a bit like spam filters). This is something iBiblio would be happy to help with, I'm sure. http://www.zdziarski.com/projects/mod_evasive/ -- Greg From gbnewby at pglaf.org Sun Feb 7 10:41:51 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Sun, 7 Feb 2010 10:41:51 -0800 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: <627d59b81002022143k3582d0fam473fcd4a01523749@mail.gmail.com> References: <56f4.5ddd8802.389a167b@aol.com> <20100203014412.GA26584@pglaf.org> <627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com> <20100203022055.GA28054@pglaf.org> <627d59b81002022143k3582d0fam473fcd4a01523749@mail.gmail.com> Message-ID: <20100207184151.GA6083@pglaf.org> On Tue, Feb 02, 2010 at 09:43:07PM -0800, don kretz wrote: > And on the other end we're hearing the same thing - the GM is there only to > manage, > and initiative for change will come from the Board. I'm absolutely not > suggesting the Board > is or should be micro or macro managing. I think everyone is expecting that > the > Board is about Planning. You're not? You disagree? Planning is exactly right. (Sorry for not responding sooner) -- Greg > On Tue, Feb 2, 2010 at 6:20 PM, Greg Newby wrote: > > > On Tue, Feb 02, 2010 at 06:00:48PM -0800, don kretz wrote: > > > That's real good news, Greg, especially if you're talking about > > flexibility > > > on > > > the DP side. 100% of the responsibility for evaluating and recommending > > > changes to the DP process has been apparently relegated to the DP Board > > > of Directors. > > > > I don't think that was the intention of the (relatively) new Board and > > new GM. The Board has ideas, but isn't trying to manage day to day > > activity. > > > > > Since you are one of the five directors, you're in the know if anyone is. > > > Since > > > you represent 20% of the horsepower responsible for coming up with those > > > changes, I trust you've been busy. > > > > Indeed, but actually we have not been looking at this level > > of detail for changes in the DP processing chain. The Board > > isn't to micromange, and isn't to get in the way of progress. > > > > That said, if you think there are proposals, ideas for change, > > etc. that are not getting the attention they deserve, I would > > be happy to bring them to the board (or GM, as appropriate) on > > anyone's behalf, anonymously if desired. > > > > -- Greg > > > > > On Tue, Feb 2, 2010 at 5:44 PM, Greg Newby wrote: > > > > > > > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote: > > > > > ... > > > > > If PG were able to easily accept a txt file now and the html version > > (and > > > > > other versions later) not only would readers get some books a year > > > > earlier, > > > > > but we could probably save some efforts that die and get lost > > somewhere > > > > > between txt complete and html complete. Why does posting have to > > happen > > > > "all > > > > > at once" ??? > > > > > > > > It doesn't. In fact, "extracting" works from DP earlier was a big push > > > > I made a couple of years ago. At that time, such two stage (or other > > > > great-than-one stage) output was something that didn't fit well with > > > > the workflow. Maybe that's something that could be revisited. > > > > > > > > It's important to not double the effort involved at the final posting > > > > phase (whitewashing) through such a two stage process. But there are > > > > several good ways of insuring this, which could be incorporated with > > > > the process. > > > > > > > > There is definitely flexibility. > > > > > > > > -- Greg > > > > _______________________________________________ > > > > gutvol-d mailing list > > > > gutvol-d at lists.pglaf.org > > > > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > > > > > _______________________________________________ > > gutvol-d mailing list > > gutvol-d at lists.pglaf.org > > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > From gbnewby at pglaf.org Sun Feb 7 10:46:25 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Sun, 7 Feb 2010 10:46:25 -0800 Subject: [gutvol-d] Re: roundlessness -- 002 In-Reply-To: References: <56f4.5ddd8802.389a167b@aol.com> <20100203014412.GA26584@pglaf.org> Message-ID: <20100207184625.GB6083@pglaf.org> On Wed, Feb 03, 2010 at 08:01:40AM +0100, Karl Eichwalder wrote: > Greg Newby writes: > > > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote: > > > It doesn't. In fact, "extracting" works from DP earlier was a big push > > I made a couple of years ago. At that time, such two stage (or other > > great-than-one stage) output was something that didn't fit well with > > the workflow. Maybe that's something that could be revisited. > > I'm all for it. In the DP forum, I proposed this several times. > > > It's important to not double the effort involved at the final posting > > phase (whitewashing) through such a two stage process. But there are > > several good ways of insuring this, which could be incorporated with > > the process. > > Could we give this a try with manually selected books first? How can we > make sure that we do not waste the whitewashers' time? Definitely. On a trial basis, the extra (or different) workload isn't such a big concern...we don't need to streamline while we're trying to experiment. >From the ww'er side, all you really need is a note with the upload that mentions "HTML will be forthcoming later," and then reference the .txt eBook # when the HTML is finally uploaded. >From the DP side, it seems that all this takes is an early extraction of formatted, proofread text, prior to going to HTML. I'm sure it's somewhat more complicated than that, due to various cascading effects and perhaps some hard-coded policy on workflow, but I hope we all could accommodate some minor upheaval in the interest of exploration. -- Greg From Bowerbird at aol.com Sun Feb 7 10:50:29 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 7 Feb 2010 13:50:29 EST Subject: [gutvol-d] rfrank reports in Message-ID: <8005.73d837a3.38a06575@aol.com> ok, rfrank has made a report over in the d.p, forums on the latest set of results from his "roundless" experiment. so let's see what he says, and what i think in reaction... *** rfrank said: > So far in testing the roundless system as it stands, > I've left it to the proofer to say when they thought a page > was done. Turns out, that is reliable for only a very few proofers. > Those who wish to say "I told you so" can chime in now, rightfully. ok. > the post-processing clearable errors were > caused mostly by four proofers, and > each of those made several different kinds of mistakes. > These mistakes were found almost entirely > on pages that were one-and-done, that is, proofed once. > So what is to be done? inform those proofers they are making mistakes, and how, and that they not doing nearly as well as they think they are. and to put this into perspective, there are about a dozen committed proofreaders taking part in this experiment, with another 5 dozen people contributing fewer pages... so four "bad" proofers constitutes about 1/3 of the lot... in other words, even though "4 proofers" sounds _rare_, the actuality is the percentage of "bad" proofers is high. this fact should _not_ be surprising. when you fail to give people any feedback on their performance, many will think they're doing a fine job, even if they're doing a terrible job. (this is a big problem over at the d.p. mothership, but we probably shouldn't be getting into that can of worms now.) after all, if they didn't think they were doing fine, they would change what they were doing, so they _could_ be doing fine. so you absolutely need to give them good and fast feedback. > One solution is to have every page > looked at by at least two proofers. that seems straightforward, but it has some gotchas. > That seems straightforward but it has some gotchas. right. :+) > If every proofer knows that every page > is going to be looked at by someone else, > will they proof that page differently > than if they intended it to be one-and-done? it's likely. so you'd need to assume it, and work from there. > I think they might. Knowing the underlying mechanism > can undermine the process. well, you must assume people "know the underlying mechanism", because you want to be open and transparent about it with them. there's really no other option when you're working with volunteers. > Also, what if the second proofer is one of the four mentioned earlier? or what if they both were? > There is a good chance that many of the errors would slip through. right. > It's easy for me to change the site code to force two looks at every page, > and I'll probably do that, perhaps even with a project in progress. doesn't matter. even after two forced looks, some errors will remain. > A down side to the "every page looked at by at least two proofers" > approach is specific to fadedpage: that there are only a dozen or so > active proofers of the 60 or so registered users. The double-look > algorithm adds about 35% to the number of page looks on a project. doesn't matter. there's no need for any haste on the books coming out... > A better solution that just a double-look is > to actually instantiate Confidence in Proofer (CiP). i was afraid you were gonna say that. and it's absolutely the wrong approach. > For these four proofers, the system could schedule a second look at > their pages even if they check the "this is done" box when done proofing. > It would give them plenty of diffs to look at, and they would be > expected to look at those diffs that show some correction was made. well, it'd be better just to inform them and educate them in the first place, rather than impose an "expectation" on them that informs them (indirectly) and forces them to educate themselves (again, in a very indirect fashion)... > If diffs were not checked, then their access to new pages > would be reduced. The kind of proofer who checks diffs, > learns, and continues to contribute is exactly what is needed. well, yeah, maybe... but you're assuming a real luxury of an overabundance of volunteers, and a willingness to throw a good number of them away as "not exactly what is needed". it's better to figure out how to find a use for _all_ volunteers. > I believe for a roundless system to work, there has to be > a reliable mechanism for stopping a page as done. d'oh. there has been complete agreement that that is the issue from day 1. > I also believe that to have a reliable way to make that determination, > some form of Confidence in Proofer needs to be in place. some people have held that belief, yes. i think it's unobtainable, and wrongheaded, and basically a dead end. even if you get a rudimentary version, it won't turn out to be useful... > Therefore, CiP, which is important, and page tweets, which are > useful and fun, are currently my main coding efforts at fadedpage. yeah, well, you'll be coming back sometime down the line and saying "those who wish to say 'i told you so' can chime in now, rightfully"... the thread has more, on confidence-in-proofer, but i'm not gonna waste any more of my time dealing with that flawed concept... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Feb 7 11:12:17 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 7 Feb 2010 14:12:17 EST Subject: [gutvol-d] how to make roundlessness work, in one brief post Message-ID: <88b8.2d7d96f9.38a06a91@aol.com> here's how to make a roundless system work, in a nutshell... 1. do aggressive preprocessing. 2. use nonintrusive zen markup. 3. submit the page to a p1 proofer. 4. repeat #3 until no change is made. 5. submit the page to a p1 proofer again. 6. if a change is made, go back to step #3. 7. if there is no change in #5, page is done. if you want it even briefer, do aggressive preprocessing, and then repeat processing through p1, until you obtain 2 consecutive rounds of no change, and the page is done. for greater accuracy, or if you have proofers in abundance, repeat until you get 3 consecutive rounds without change. (but the increased accuracy isn't worth the increased work.) for lesser accuracy, stop after 1 round that sees no change, but the decreased accuracy here is too high a price to pay... aggressive preprocessing is the secret, because most errors can be located automatically, so the pages are clean before they even get to "proofers", who are really "smoothreaders". this, of course, is the same formula i've suggested for years. (once you've hit upon the right answer, no reason to change.) you can easily assure yourselves that this is the right answer; track how many errors persist through 2 rounds of no-change. (versus how many persist through 1 round, 3 rounds, 4, etc.) no need to collect any messy stats. just 2 rounds of no-diff. the time you spend exploring other stuff is just wasted time. just watch. this is the formula that will prove to be the best. and when you get around to admitting it, i'll say "i told you so". -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Feb 7 14:13:45 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 7 Feb 2010 17:13:45 EST Subject: [gutvol-d] roundlessness -- 003 Message-ID: we're looking at rfrank's "roundless" experiment at fadedpage.com... *** ok, it's been a while since i went through this little drill, so i'm gonna give you a little refresher-course on where i stand on a workflow that distribution proofreaders could be using, particularly from the viewpoint of rfrank's roundless research. *** first off, i would substantially re-ramp the "preprocessing"... this is the first step, where you're doing the scanning and o.c.r., or fetching the results from one of the major scanning projects. we'll start at the very beginning... *** i've advocated -- strongly -- a filenaming system that indicates, for each file, the contents of the file, so the name is meaningful. specifically, the _page-number_ of the file should be included in a consistent way in the name of a file. thus, for instance: > http://z-m-l.com/go/myant/myantp123.jpg the image-file for page #123 in the book "my antonia" is named myantp123.jpg. the "myant" prefix is common to files for the book, and the "p123" part refers to page 123. this is so straightforward that you couldn't be faulted for assuming that it's taken for granted. but it's not. many of the content-providers over at d.p. name their files differently, so the filename is _not_ unambiguously tied to the page-numbered contents inside; and a price is paid for this unclarity. yes, rfrank is one of these ill-advised content-providers, and he has carried over this bad habit to this experiment on roundless proofing. this shortcoming is even worse in a roundless system, because there is a need to refer to specific pages in such a system, and the absence of a sensible naming-scheme therefore becomes a bigger problem... (in a round-based system, where all the pages are treated as a "batch", the problem isn't as bad, but it's still something that should be fixed.) *** next, there are a number of things that are done in "preprocessing", at distributed proofreaders and by rfrank in his roundless experiment. some of these things should _not_ be done, in my considered opinion. on the other hand, there are other things that _should_ be being done. these are some things that are done which should _not_ be done: 1. run-heads and page-numbers are eliminated. 2. end-line hyphenates are being joined. 3. end-line em-dashes are being "clothed". these are some things that _should_ be done, which are not: 1. obvious and easily-located problems should be fixed. 2. spacey-quotes should be fixed. 3. ellipses should be standardized on a 3-dot ellipse. we can engage in debate on all of these suggestions, but it will be instructive for us to see some of the o.c.r. errors i'm talking about. appended to this post is a list of bad words or lines that were pulled from a book currently being proofed at rfrank's roundless experiment. these are almost all errors that will certainly need to be fixed. aggressive preprocessing can find these errors without looking for them. that's very important, in a roundless system, because if you can find and fix all the errors _before_ a page is subjected to a word-by-word review, then that word-by-word review can become the first "no-diff" in a chain of "no-diff" reviews, and the shorter that chain is, the more efficient you are. in the other scenario, if the page is dirty, you might have to have 1 or 2 (or even 3) proofings before the page is clean enough to receive a no-diff, meaning your efficiency has plummeted. there's no reason to make a human _search_ for errors when those errors can be located quickly and efficiently by a computerized search routine... -bowerbird p.s. sorry i've been sluggish with this series. i thought i'd been away from this stuff long enough that it wouldn't bore me to do it all again, but so far it has been a drag... i can only repeat this so many times... i'll try to get the motivation back again, but no promises if i cannot... p.p.s. here's that list of "probable errors" pulled out of rfrank's book: of'more Fd say pver t'other Enew a'slipped pn curtiss-robih somethin'to ght asighin' buncft whick mother'Ship ground^igood hefteaw iny wonderf ^Devolutions punkins outen tpreviously bustl gun'ls thaft ag^ny sud-v denly ij J>ack jumpin'his blame'em Jiack apture twise Weil numba chorteled pvounded Wowl oaly ^{w}Oh! haulded uae vre knovt apeak gink givin* stretch.'* J'ack wheen clost you'l althought eyefull weuns I" oa pHot etchin' jest'magine iframe ha.nd you'fe valk a'been outen morc'n MCGrath tc Unc' wuss'n pizen fresfi hirn orr hinsel| But you never can tell just what may hap]7^{en sion of weird noises springing up from the g<9^{a}^ 74-* /' 161 17S "Working over a bird with red feathers,'^{1 fall for such a decent game 93 taxidecentry or 18? was the work of a few second3. Hardly had the THE COMEBACK 227 HUNT OF THE S-18 belonged to the Hun pHot, Oscar Gleeb. must be pesU, at Jn' you like all get-out, so I made cruel to keep me a'guessing any longCi. so anxious to get started oA their way Porter Press disappears from an airpnQp .Watch my smoke, that's all." .until finally it died away completely. This gave .keep your eyes fastened on him. Whatever . A fortune hangs in the balance when young Dan Tierney, press , "Gripes! that was worth somthin'to glimpse, ; for Perk valued a few words of praise ;whooped the delighted Perk as he squatted pn you to hold my own. That's j^t how it should say,^{r} she bust out o' that little fog cloud right such as would tell ^ business being put through rendezvous and^ it's our game to chase after them, light, whether ^{r}'and or gulf, the chances were holding ground^igood--a heavily laden sailing path, and going through the most wonderf ^Devolutions. and be ready to pounce down on their inten^{n} a fat duck that had been selected out of the flc^{c}k One thing he did do was to cut his intend^ wide circle short and again head toward ^{ne }scene of action, a move that certainly afford^ the eager Perk more or less satisfaction, he bel^{n} thrilled with the expectation of breaking into th^{e }game without much more loss of time. But you never can tell just what may hap]7^{en }when rival forces are striving against one ^^{n}' other. The best laid plans often go wrong a^{n}^ there was always a chance of the unexpected happening. Hardly had the airship whipped around ag^^{n }so as to head into the north than Perk beca<"^{e }aware of the fact that there was a sudden acc^{es}' sion of weird noises springing up from the g<9^{a}^ toward which they were now aiming. Jack, t^ must have caught the increased volume, for ^{ne }sheered off as if to hold back a bit so as to gr^^{s}P the meaning of the new racket. Men were no longer simply talking or laug^h" ing as they so cheerfully labored in transferring some of the contraband from the sloop to ^{ne }deck of the speedboat--their voices were rai^^{e}^ to shouts in which surprise, even the element ^ the frenzied sufferers in their ag^ny had been ^{w}Oh! That can be put through without muck said that name exactly three tin^s, like it meant operator as Oswald Kearns pick oui^an ordinary would rather have Jack praise him than ^ny one ^{r}fully five feet long and as thick through the body him, no matter where he goes-^-sorter dude, I'd will y^{u}> boy--two--three fellers jest swarmed "Working over a bird with red feathers,'^{1?? M^ans our gent has a raft o' ships comin' an' through fire or some similar means of destruc^ our man ditto. Mebbe now I'll soon^{x}get a chance "Gosh, amighty, we're flyin'\Mgh, buddy!" -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Mon Feb 8 14:30:18 2010 From: jimad at msn.com (Jim Adcock) Date: Mon, 8 Feb 2010 14:30:18 -0800 Subject: [gutvol-d] Re: Formats and gripes In-Reply-To: <4B6923EA.6050700@perathoner.de> References: <20100124200541.GH27785@pglaf.org> <4B60A4D4.8050701@perathoner.de> <4B613AB1.4020302@perathoner.de> <4B61D55B.8010405@perathoner.de> <4B65299A.7060304@perathoner.de> <4B6731F7.8050503@perathoner.de> <4B67421F.7090500@perathoner.de> <4B683B58.90800@perathoner.de> <4B6923EA.6050700@perathoner.de> Message-ID: >But YOU are perfectly free to volunteer your time to save Amazon some bucks: Take my epubs, patch them, and convert them to mobis that display the toc when you hit the toc button, and redistribute them on your site. Again, the OPF file format specifies that TOC and NGX are separate things, intended for separate purposes. ADE screws the pooch on this one, providing one interface for both the TOC and NGX. Given that ADE screws the pooch, you and many other epub implementers make a pragmatic decision to target ADE for your generated epubs, generate a NGX, and leave out the TOC. Kindle expects to see both TOC and NGX and uses them for distinct purposes -- as OPF specifies! If Kindle were to emulate ADE's mistakes, then there could also not be a separate TOC and NGX on Kindle, fulfilling their separate purposes as specified in the OPF specification. It is not a question of wasting your time or wasting my time but rather wasting PG "customers" time because what PG is providing today to customers is broken. It would be less broken if there weren't a lot of extra PG verbiage at the start of books making it harder for customers to find the embedded HTML TOC which would allow the customers to navigate to that which they want to read. But it would be even better if customers could push the "TOC" button on their machine and have a TOC actually displayed -- as happens with "real" e-books. Further, the whitewashers' typically require an additional "real" TOC to be implemented in the HTML, which then is also not actually being used, resulting in additional wasted time and energy on the part of the volunteers. PG for example, could simply adopt a convention that a TOC.html be shipped with a submitted HTML, linked into that doc, and then you could link to that TOC.html, and the time and effort that whitewashers are asking of submitters then would not be wasted. It would be fine if you don't want to do it right if PG would allow submission of generated EPUBs and MOBIs so submitters could choose to do it right and not have their time and efforts wasted. But, you don't allow that to happen either! Again, all that these policies today accomplish is that customers get frustrated with PG, take PG books apart and rebuild them properly -- taking off the PG name and verbiage in the process, and redistribute them on other sites. I just think it would be nice if PG volunteers get recognized for the time and effort they contribute by having people actually read "PG" books, as opposed to rebranded ex-PG books where the name has been taken off so customers don't even realize they are reading something which is 99% PG efforts. From jimad at msn.com Tue Feb 9 11:39:39 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 9 Feb 2010 11:39:39 -0800 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: <8005.73d837a3.38a06575@aol.com> References: <8005.73d837a3.38a06575@aol.com> Message-ID: > If every proofer knows that every page > is going to be looked at by someone else, > will they proof that page differently > than if they intended it to be one-and-done? Under the *current* DP system everyone knows that everything being done is also going to be worked on by about six other people. The hard part then is getting anyone to feel "ownership" about anything -- particularly about getting something *done*. Automatic scoring of proofing efforts and automatic reporting back of scannos that slip by that other people find -- without making a "big deal value judgement" about those that slip by might make a positive contribution. Getting more people who care to read the finished or almost finished product and providing an easy and convenient way to give feedback on bugs found, or god forbid to be able to actually fix those bugs directly might also make a contribution. From jimad at msn.com Tue Feb 9 11:48:15 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 9 Feb 2010 11:48:15 -0800 Subject: [gutvol-d] Re: how to make roundlessness work, in one brief post In-Reply-To: <88b8.2d7d96f9.38a06a91@aol.com> References: <88b8.2d7d96f9.38a06a91@aol.com> Message-ID: >aggressive preprocessing is the secret, because most errors can be located automatically, so the pages are clean before they even get to "proofers", who are really "smoothreaders". Agreed with this part at least -- many motivated "early readers" love a particular author, and would be happy to get early access to the text via some kind of tool that allowed them to fix or at least mark the bugs they find as a part of their reading. "Marking" bugs as a part of reading could be as simple as asking them to read on a notepad or what have you and put a Q-mark in the text where they think they see a bug. Then diff their back submission to find the bugs that need to be fixed. Readers of e-books could even back-submit a "bookmarks" file that tags where errors were seen allowing proofing to be done on any e-book reader. From klofstrom at gmail.com Tue Feb 9 11:56:59 2010 From: klofstrom at gmail.com (Karen Lofstrom) Date: Tue, 9 Feb 2010 09:56:59 -1000 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> Message-ID: <1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com> On Tue, Feb 9, 2010 at 9:39 AM, Jim Adcock wrote: > Under the *current* DP system everyone knows that everything being done is also going to be worked on by about six other people. The hard part then is getting anyone to feel "ownership" about anything -- particularly about getting something *done*. Jim, this is unfair to DP and to those of us who work there. I'm a high-count proofer in P3. I do care about finishing off books ... indeed, I'm a member of P3 Archers, a team that works to "shoot down" books that are almost-but-not-quite finished (we completed 27 projects last week). I did my share of slogging on the Baburnama, a nightmare project with lots of diacritic-spattered Turki, as well as other mouldie oldies. I also care about the quality of my work. I can't be sure that a formatter or a PPer is going to catch an error if I miss it in P3. I spellcheck and if I'm not sure of a word, look it up in OneLook online dictionary. I'm not sure that the current system at DP is the best possible, but I also know that various groups are experimenting with other workflows. It's a Rube Goldberg contraption in some ways, but it does keep putting out the books: more than 17,000 at last count. -- Karen Lofstrom From jimad at msn.com Tue Feb 9 13:27:23 2010 From: jimad at msn.com (James Adcock) Date: Tue, 9 Feb 2010 13:27:23 -0800 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: <1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com> Message-ID: >> Under the *current* DP system everyone knows that everything being done is also going to be worked on by about six other people. The hard part then is getting anyone to feel "ownership" about anything -- particularly about getting something *done*. >Jim, this is unfair to DP and to those of us who work there. I'm a high-count proofer in P3. I do care about finishing off books ... I have two books, highly requested, in DP that I spent about 40 hours each getting them into DP and where they have been moldering for almost a year now. They are "stuck" and there is no way to get them unstuck and the txt has been "ready to go" from almost the beginning. Again, the txt part, including P1, P2, P3 is the easy part of the problem, and is working relatively well compared to the rest of the DP process. This compares, for example, that I can personally crank out a book -- perhaps not quite as good as DP -- taking about the same 40 hours *total*, and can get it done including HTML in less than a month elapsed time including god knows how many family emergencies intruding on my efforts. I *try* to take ownership of these books at DP but am prevented in doing so by the system and the management -- god knows if I were allowed to do so I would personally have finished them off a half a year ago! A fundamental part of the DP problem is that the "design" (if you want to call it that) of the queuing system doesn't work. Another part of the problem, frankly, is the disproportionate amount of time spent on books that are very complicated, poorly scanned, and not very good choices to begin with -- meaning simply that they are books when all is said and done that not that many people are going to want to read. Under the current system bad ideas are allowed to consume a disproportionate amount of everyone's time and effort -- but isn't that true of life in general! From ajhaines at shaw.ca Tue Feb 9 13:40:30 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 9 Feb 2010 13:40:30 -0800 Subject: [gutvol-d] Re: rfrank reports in References: <8005.73d837a3.38a06575@aol.com> Message-ID: <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> Re: >Getting more people who care to read the finished or almost finished >product >and providing an easy and convenient way to give feedback on bugs found, or DP-US and DP-Canada both have a smooth-read facility, with instructions on how report problems. >god forbid to be able to actually fix those bugs directly might also make a >"contribution. Allowing the hoi polloi, as it were, to "fix bugs" is a sure-fire way of introducing errors. I occasionally have to disallow an errata-reported error because the reporter wasn't aware that a word was, in fact, valid. For example, "ancle" is a valid, albeit archaic, variant of "ankle", and is not an error. But, if it's a typo/scanno for "uncle", it is. I've also handled reported errors where the error was real, but the suggested correction was incorrect. Al ----- Original Message ----- From: "Jim Adcock" To: "'Project Gutenberg Volunteer Discussion'" Sent: Tuesday, February 09, 2010 11:39 AM Subject: [gutvol-d] Re: rfrank reports in >> If every proofer knows that every page >> is going to be looked at by someone else, >> will they proof that page differently >> than if they intended it to be one-and-done? > > Under the *current* DP system everyone knows that everything being done is > also going to be worked on by about six other people. The hard part then > is > getting anyone to feel "ownership" about anything -- particularly about > getting something *done*. > > Automatic scoring of proofing efforts and automatic reporting back of > scannos that slip by that other people find -- without making a "big deal > value judgement" about those that slip by might make a positive > contribution. > > Getting more people who care to read the finished or almost finished > product > and providing an easy and convenient way to give feedback on bugs found, > or > god forbid to be able to actually fix those bugs directly might also make > a > contribution. > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From jimad at msn.com Tue Feb 9 14:04:06 2010 From: jimad at msn.com (James Adcock) Date: Tue, 9 Feb 2010 14:04:06 -0800 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> Message-ID: >DP-US and DP-Canada both have a smooth-read facility, with instructions on how report problems. Below describes how one might get started doing smooth reading if anyone cares to see in part what the problem might be: http://www.pgdp.net/wiki/Smooth-reading_FAQ If there were a tie-in between PG and DP to let people in general know when SR is happening DP might get more SRs. A list of books "on deck" if you will and how to get them. >Allowing the hoi polloi, as it were, to "fix bugs" is a sure-fire way of introducing errors. I occasionally have to disallow an errata-reported error because the reporter wasn't aware that a word was, in fact, valid. For example, "ancle" is a valid, albeit archaic, variant of "ankle", and is not an error. But, if it's a typo/scanno for "uncle", it is. I've also handled reported errors where the error was real, but the suggested correction was incorrect. This would be a problem that DP already has because in my experience many a P3 "knows" so well how to do their job that they never bother to double-check what it is the author actually wrote or that which the publisher actually published -- which in practice turns them into gold plated SRs. Don't get me wrong, DP has many excellent dedicated people at all levels, including all levels of P1, P2, P3 -- its just that moving up the ranks doesn't necessarily mean people are actually getting any better at what they are doing. And the queuing system guarantees that the upper level "experts" are going to be overloaded. From klofstrom at gmail.com Tue Feb 9 14:57:25 2010 From: klofstrom at gmail.com (Karen Lofstrom) Date: Tue, 9 Feb 2010 12:57:25 -1000 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> Message-ID: <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> On Tue, Feb 9, 2010 at 12:04 PM, James Adcock wrote: > This would be a problem that DP already has because in my experience many a P3 "knows" so well how to do their job that they never bother to double-check what it is the author actually wrote or that which the publisher actually published -- which in practice turns them into gold plated SRs. And how would you know this? Long experience as a formatter or PPer? -- Karen Lofstrom From prosfilaes at gmail.com Tue Feb 9 15:01:44 2010 From: prosfilaes at gmail.com (David Starner) Date: Tue, 9 Feb 2010 18:01:44 -0500 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com> Message-ID: <6d99d1fd1002091501v33bfdcc5q699d1af4ea3590fc@mail.gmail.com> On Tue, Feb 9, 2010 at 4:27 PM, James Adcock wrote: >?Again, the txt part, > including P1, P2, P3 is the easy part of the problem, and is working > relatively well compared to the rest of the DP process. Since when has it been okay to toss out italics, indentation of poetry and proper footnotes in the text file? -- Kie ekzistas vivo, ekzistas espero. From Bowerbird at aol.com Tue Feb 9 15:56:44 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 9 Feb 2010 18:56:44 EST Subject: [gutvol-d] Re: rfrank reports in Message-ID: <19a40.fc4738.38a3503c@aol.com> jim said: > Under the *current* DP system i'm steering clear of discussing the d.p. system right now; there's so much cruft over there it's not worth the trouble. i'm either discussing rfrank's experiment, or my own system. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Feb 9 16:08:33 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 9 Feb 2010 19:08:33 EST Subject: [gutvol-d] Re: how to make roundlessness work, in one brief post Message-ID: <19fdb.2d582660.38a35301@aol.com> jim said: > many motivated "early readers" love a particular author, > and would be happy to get early access to the text via > some kind of tool that allowed them to fix or at least mark > the bugs they find as a part of their reading.? "Marking" bugs > as a part of reading could be as simple as asking them to > read on a notepad or what have you and put a > Q-mark in the text where they think they see a bug. while you seem to be talking about smoothreading here, the text you quoted from me was about preprocessing... preprocessing happens before the text goes to any proofer -- it's scheduled immediately after o.c.r. has been done -- and it doesn't require reading of _any_ kind at all, which is why it is about fourteen times more efficient than proofing. a preprocessing tool finds glitches that are almost certainly errors, and takes you to them directly in the text-file while displaying the appropriate scan for referral, and often even gives you buttons that will perform the desired correction... some glitches (like spacey quotes) can even be auto-fixed. i have demonstrated here that, for several books i tested, this preprocessing would take the errors down to a rate of less-than-one-error-every-10-pages, which makes the word-by-word proofing rounds almost like smoothreading. check out dkretz's "twisted" tool to see an example of this... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From ke at gnu.franken.de Tue Feb 9 19:59:20 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Wed, 10 Feb 2010 04:59:20 +0100 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: (James Adcock's message of "Tue, 9 Feb 2010 13:27:23 -0800") References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com> Message-ID: "James Adcock" writes: > many family emergencies intruding on my efforts. I *try* to take ownership > of these books at DP but am prevented in doing so by the system and the > management -- god knows if I were allowed to do so I would personally have > finished them off a half a year ago! A fundamental part of the DP problem is > that the "design" (if you want to call it that) of the queuing system > doesn't work. I also consider this a serious defect. IMO, it must be possible, if someone want to work on a book, to "activate" it (= unlock it from a waiting state). > Another part of the problem, frankly, is the disproportionate amount > of time spent on books that are very complicated, poorly scanned, and > not very good choices to begin with -- meaning simply that they are > books when all is said and done that not that many people are going to > want to read. Under the current system bad ideas are allowed to > consume a disproportionate amount of everyone's time and effort -- I'm always wondering why people work on books they are not interested in... > but isn't that true of life in general! Probably ;) -- Karl Eichwalder From jimad at msn.com Tue Feb 9 20:59:30 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 9 Feb 2010 20:59:30 -0800 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com> Message-ID: >I'm always wondering why people work on books they are not interested in... Because the "good books" don't get released from the queue until these other ones get finished, and because many people who volunteer for DP are incredibly open hearted. From jimad at msn.com Tue Feb 9 21:12:22 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 9 Feb 2010 21:12:22 -0800 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> Message-ID: >And how would you know this? Long experience as a formatter or PPer? It's not hard to see, actually, when a P3 or others make changes which don't match the page images. You just have to actually look at one and then the other. I've seen many great proofers at all of the P1, P2, and P3 levels. I have also seen "well reputed" P3's who turn out results that don't match the page images. The text they create scans perfectly fine, it's just that it's not what the author wrote -- particularly when it comes to punctuation. Best way to get great results is to have people working on a text and an author they absolutely love, not just cranking out the numbers. From jimad at msn.com Tue Feb 9 22:59:46 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 9 Feb 2010 22:59:46 -0800 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> Message-ID: >>And how would you know this? Long experience as a formatter or PPer? >It's not hard to see, actually, when a P3 or others make changes which don't match the page images. I just checked my previous claims about the problems with P3 (for example) against a pretty straight-forward text. I reviewed 200 pages from that text. P3's made 38 changes on those pages. Of these changes 7 changes represented a positive contribution towards making the txt correct. Of those positive changes about a half could easily be found by a simple tool like guiguts. 10 of the changes introduced by the P3's were negative changes -- changes that moved the text to a less perfect state. The remaining 21 changes were basically "null changes" relating to established DP procedure, which neither really made the txt any better nor any worse. Most of the negative changes were relating to punctuation, as I previously claimed. Again, it's really hard for the human mind to accept that the best thing to do when things aren't broken is to leave them alone -- people really want to make a "positive contribution" by changing things. By my calculation DP is cranking out an average of 194 books a month -- which is impressive. But consider some of the upper level queue times: 2000 books stuck in P3 = 10.3 Months stuck in P3 2840 books stuck in F2 = 14.6 Months stuck in F2 2562 books stuck in PP = 13.2 Months stuck in PP Total 38 Months, about 3 years waiting on these higher level queues, which means it takes about three and a half years in total for a book to get through DP nowadays? -- And getting longer every day. Seems "pretty obvious" to me looking at the DP "red bar" graph at http://www.pgdp.net/c/activity_hub.php that the P3, F2 and PP efforts are "out of control." Which doesn't mean that one should admonish the troops to do better. Rather, it means that the process needs to be redesigned to fit the resources actually available -- somehow you have to move more people into the roles currently labeled "P3, F2, and PP" or you have to redesign things to make their jobs MUCH faster and easier, or you have to redesign the process, or redesign the goals of the organization. I'm not saying this is good or this is bad -- I'm just saying that this is obvious! You cannot indefinitely run an organization that takes more orders in the front door than you ship out the back door -- no matter how big hearted you are. From klofstrom at gmail.com Tue Feb 9 23:25:55 2010 From: klofstrom at gmail.com (Karen Lofstrom) Date: Tue, 9 Feb 2010 21:25:55 -1000 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> Message-ID: <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> On Tue, Feb 9, 2010 at 8:59 PM, Jim Adcock wrote: > I just checked my previous claims about the problems with P3 (for example) against a pretty straight-forward text. ?I reviewed 200 pages from that text. ?P3's made 38 changes on those pages. ?Of these changes 7 change represented a positive contribution towards making the txt correct. Of those positive changes about a half could easily be found by a simple tool like guiguts. 10 of the changes introduced by the P3's were negative changes -- changes that moved the text to a less perfect state. I'd have to look at them before trusting you on this, as you seem to have an extremely negative, fault-finding attitude towards DP. I wonder if you'd count my occasional bracketed comments, such as [**P3--seems to be a mistake in the original; s/b ;], as errors. > The remaining 21 changes were basically "null changes" relating to established DP procedure, which neither really made the txt any better nor any worse. Nonetheless, they were useful to the formatters and PPers as making the text predictable. > ... which means it takes about three and a half years in total for a book to get through DP nowadays? -- And getting longer every day None of us likes that! Yes, the current round system is broken. It produces better texts than the old 2-round system. Some of the second-round proofers in those days wanted page count and didn't give a #$%@$#% about accuracy. The results were as dismal as you would expect. However, we've now producing very good texts at an enormous cost. We're discussing further changes. It doesn't particularly help, when one is drowning and flailing about for a handhold, to have a bystander jumping up and down, shouting, "You're drowning, you idiot!" -- Karen Lofstrom From Bowerbird at aol.com Wed Feb 10 14:02:30 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 10 Feb 2010 17:02:30 EST Subject: [gutvol-d] Re: rfrank reports in Message-ID: <134f7.6b712356.38a486f6@aol.com> jim said: > I just checked my previous claims about the problems with P3 > (for example) against a pretty straight-forward text. ? > I reviewed 200 pages from that text. ? > P3's made 38 changes on those pages. ? > Of these changes 7 change represented a positive > contribution towards making the txt correct. > Of those positive changes about a half could > easily be found by a simple tool like guiguts. > 10 of the changes introduced by the P3's were negative changes > -- changes that moved the text to a less perfect state. jim, and others, if you're going to continue to discuss the problems in the current system at distributed proofreaders, please do it in your own thread, and not in my threads, ok? as for your findings, jim, it helps to report the actual lines. but it's probably not necessary. in the past, i have documented the same things you report, in great detail, in book after book after book, with evidence. in comparison, your anecdotal reports are relatively flimsy. i'm not saying you're wrong. indeed, you're absolutely correct. i'm just saying your reports are not going to convince anyone. heck, there are people here who refused to believe what i said, in spite of the fact i piled up enough evidence to choke a horse. (nor did i _create_ the evidence; i used data taken directly from various experiments, performed over at d.p. by other people... the truth is out there, and easy to find, if you just care to look. this is what is so silly about all these "experiments". i've told everyone here the simple correct answers, so all that's needed is to test these simple hypotheses and see they _are_ correct. but instead people are testing overly complicated stuff in ways that are not definitive, leading them to become more confused.) *** perhaps the most impressive findings of my results were these: 1. the best way to know a page is "finished" is when proofers stop making changes to it... up to that point, it's not finished! it's the "best" way to know because a no-diff is easy to measure. 2. 3 rounds of p1 were as effective as a series of p1-p2-p3. (simple solution to queue problems? run the text through p1 until every page comes out repeatedly as a no-diff page.) 3. the third round of p1 found as many additional errors as the p3 round, but _neither_ route found _all_ the errors. the p1(3) proofers found errors that the p3 proofers missed, and the p3 proofers found errors the p1(3) proofers missed, plus there were other errors that both p1(3) and p3 missed. (takeaway: the p1 proofers are not inferior to the p3 proofers, and a "make the page better" philosophy will eventually work to "create a perfect page", without all the attendant pressure.) 4. "parallel" p1 was _not_ useful at turning up any more errors, but it might have value to determine a page is "done", although more research would need to be done to test that hypothesis... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Feb 10 17:57:31 2010 From: jimad at msn.com (Jim Adcock) Date: Wed, 10 Feb 2010 17:57:31 -0800 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> Message-ID: >However, we've now producing very good texts at an enormous cost. We're discussing further changes. It doesn't particularly help, when one is drowning and flailing about for a handhold, to have a bystander jumping up and down, shouting, "You're drowning, you idiot!" If I'm a bystander it is because the texts that I have submitted to DP which I thought I would be working on have been frozen in the queues by DP for the last year. What you suggest instead is that I also jump into the pool and start flailing around with you. Been there done that, got tired of it, climbed back out of the pool. Flailing harder or faster or throwing more people into the pool really isn't going to help. I have made what I consider many positive suggestions, any of which simply invoke anger and defensiveness on the part of DP'ers: One of which is post the text after P3 rather than waiting to finish PP. This would make about an additional 4,000 texts available on PG. If one counts volunteer hours worth $10/hr this represents an "unfinished inventory" of about $2,000,000. If you value PG downloads similar to Amazon's minimum cost of $1 a book, then these 4,000 would generate about $150,000 a year in additional value to society. Other obvious suggestions would be to adjust your "experience" thresholds and testing methods for admittance to P3, F2, PP in order to allow a bit more people into these areas and see how much it *really* hurts your quality and productivity -- or not! Fundamentally it is the unbalanced number of people allowed into the upper rounds (or rather not allowed into the higher rounds) which is killing you. Further, any tools that you can offer P3, F2, or PP to make their lives easier would help you greatly. Another suggestion I have made is to do what many other commercial digitizers of text using human beings do: Run two humans in parallel on the same text and then diff the results. If you get a diff on some page run a third person and vote the results. If you were to double up on the P1 and P2 efforts like this that would help the P3 queue. If you doubled up the F1 efforts that would help the F2 queue. Don't know how to help the PP queue except I don't understand why you allow almost finished texts to be stuck moldering in the hard-drives of one PP'er so long. If a PP'er just can't get it done -- take it away and assign it to someone else. Doesn't matter how good or experienced a PP'er is if they just can't get it done. Another suggestion is to auto score proofers and formatters efforts and automatically assign them to the place in your process where their level of abilities will do the most good -- or at least the least damage. It is easy to auto score the P1, P2, and F1 efforts -- it is basically the ratio of the number of fixes that they make divided by the number of fixes made on the same pages by the successive round. Have the P3s and F2s "retest" on a P2 or F1 round occasionally so that you can autoscore whether they still know what they are doing or not. Another suggestion would be to update the toolset being used to make them more fun, less time-wasting, and less tweaky. Simple common tasks ought to be simple, unpainful, and fast. Allowing higher rez page scans for the people with the bandwidth to handle them would make all the rounds easier. Another suggestion would be to get PG to allow one to query on how many downloads various texts are getting, so that people who are submitting texts to DP which aren't getting read might get some feedback about what their actions is really accomplishing, or not. Modifying bowerbird's suggestions slightly, there *are* at least some texts that fit pretty well into template forms, such as some simple novels. Perhaps an automated or semi-automated tool for turning these simpler texts into HTML quickly? Another obvious suggestion is that there are too many texts in the world to take them all on. Are the readers of PG really interested in "Annals of the Annual Proctology Meeting of 1847" ? Is there at least some way to try to discourage really bad ideas? Looking at the actual text of the English language submissions in P1 right now it looks to me that about half of them have a reasonable chance of being read. Is there any way to more actively promote the acquisition and prioritizing of texts that are generally recognized as being "better than average" aka "famous" or at least "well known"? Another obvious suggestion would be to empower PM's to have at least one active project where if that project gets stuck they are allowed to take whatever actions necessary to get it unstuck.... From prosfilaes at gmail.com Wed Feb 10 19:14:05 2010 From: prosfilaes at gmail.com (David Starner) Date: Wed, 10 Feb 2010 22:14:05 -0500 Subject: [gutvol-d] Re: rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> Message-ID: <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> On Wed, Feb 10, 2010 at 8:57 PM, Jim Adcock wrote: > One of which is post the text after P3 rather than waiting to finish PP. To which I pointed out that this would in many cases result in the posting of severely deficient texts. Formatting is important. >?Don't know how to help the PP queue > except I don't understand why you allow almost finished texts to be stuck > moldering in the hard-drives of one PP'er so long. If a PP'er just can't get > it done -- take it away and assign it to someone else. Doesn't matter how > good or experienced a PP'er is if they just can't get it done. Because sometimes it may be worth letting a text molder rather then preemptorially ripping it out of someone's hands and annoying the hell out of them. > Perhaps an automated or semi-automated tool for turning these simpler texts > into HTML quickly? Is guiguts not quick enough for you? This is a fairly simple tool problem. > Another obvious suggestion is that there are too many texts in the world to > take them all on. ?Are the readers of PG really interested in "Annals of the > Annual Proctology Meeting of 1847" ? It's easy to come up with a rhetorically stupid title. But if you pulled a real title, then we could actually discuss the audience and why someone would upload that. > Is there any way to more actively > promote the acquisition and prioritizing of texts that are generally > recognized as being "better than average" aka "famous" or at least "well > known"? That presumes that that should be our goal. Some of the works I'm proudest of are works where the PG edition is the best in the world. Sure, more people may read the Canterbury Tales, but every who reads our edition of Stephen Hawes's "A Joyful Meditation of the Coronation of King Henry the Eighth" is thrilled that we have it, because the alternative was deciphering the blackletter originals and trying to figure out the lost parts yourself. Augustan Reprint Society works are a large class of works I've done where they have some scholarly interest, but the reader will only find facsimiles outside of PG. On the other hand is stuff like "1931: A Glance at the Twentieth Century" by Henry Hartshorne. It is none of those things; it's just a fun work to read, even if that fun comes at its own expense. I don't think anyone who worked on it is the least bit unhappy about that. -- Kie ekzistas vivo, ekzistas espero. From schultzk at uni-trier.de Thu Feb 11 01:10:26 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 11 Feb 2010 10:10:26 +0100 Subject: [gutvol-d] Revisting DP In-Reply-To: <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> Message-ID: <975FD7DD-C4F0-49B9-86FB-CC23F727A912@uni-trier.de> Hi Everybody, I have been following the thread "rfrank reports in". [Yes, BB its been hijacked] It seems obvious to all that the DP system has severe deficiencies. The question is how to help. Which leads to the question of what is flawed. It is obvious that the system after P3 is evidently to complex. Furthermore, the method of creating the perfect ebook. This has resulted in that there are evidently to few persons that can be trusted with this complexity. The method in general is not the problem, but the rules that have to be abided by!! I have suggest in the past other alternatives which are by far simpler and would produce the required results. I would implement a system if i had the time, furthermore it would be a one person operation. DP has a hugh amount of person-power which the could use more efficiently. As we all have noted. But, until the ones who have the say over at DP are willing to simplify their system the problems will persist. The formating/transcription rules required by DP have been developed over time, yet the where evidently added in in an ad-hoc manner. Any system over time should be revamped and streamlined. Optimized if you will. Sure a few tools need to be rewritten, but the basic frame should already be there so that should not propose a great ordeal. The other questions that remain are what is a perfect book? Or, What is a predictable book? regards Keith. From Bowerbird at aol.com Thu Feb 11 10:29:41 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 11 Feb 2010 13:29:41 EST Subject: [gutvol-d] Re: Revisting DP Message-ID: keith said: > It seems obvious to all that > the DP system has severe deficiencies. > The question is how to help. you need to discuss that over at d.p. they don't listen to anything over here. the only reason i discuss things here is because they banned me from there -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Thu Feb 11 23:48:56 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Fri, 12 Feb 2010 08:48:56 +0100 Subject: [gutvol-d] Re: Revisting DP In-Reply-To: References: Message-ID: <9A521268-9BC7-4B06-8F38-4AF58E087547@uni-trier.de> Hi BB, I think we know the problems DP has with help. Just more or less rounding things up that were discussed here. Yet, as you said ranting here will not help thing overthere, and ranting there get you put on the unwanted list, no matter how polite you are. regards Keith. Am 11.02.2010 um 19:29 schrieb Bowerbird at aol.com: > keith said: > > It seems obvious to all that > > the DP system has severe deficiencies. > > The question is how to help. > > you need to discuss that over at d.p. > > they don't listen to anything over here. > > the only reason i discuss things here > is because they banned me from there > > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Fri Feb 12 10:47:08 2010 From: jimad at msn.com (Jim Adcock) Date: Fri, 12 Feb 2010 10:47:08 -0800 Subject: [gutvol-d] DP: was rfrank reports in In-Reply-To: <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> Message-ID: >To which I pointed out that this would in many cases result in the posting of severely deficient texts. Formatting is important. OK, but I can also point to texts that were almost "good to go" before they went into DP, only to molder indefinitely there. Is there some way to make a decision on this one way or another. How about letting the PM make the decision whether or not to post a "preliminary version" to PG? >Because sometimes it may be worth letting a text molder rather then preemptorially ripping it out of someone's hands and annoying the hell out of them. OK, but how and when do you decide that the PP has actually moved on in life and is not really willing to finish up the book to which others have in good faith contributed their blood sweat and tears in the hopes of getting an honest to god book? Not to mention the possibility of a PP not working in good faith? >Is guiguts not quick enough for you? This is a fairly simple tool problem. Tried it previously and didn't find any value in it. I will take a look at it again. >It's easy to come up with a rhetorically stupid title. But if you pulled a real title, then we could actually discuss the audience and why someone would upload that. Pick any title active in the rounds right now. Based on the best statistics I can find on PG usage, which is actually from IA, the most popular books from PG get read literally 100,000 times more often than the least read books. Now, it is hard to find a book that is going to be that popular. But it is easy to find a good book which will get read literally 40x more often than the books in DP right now, as well as being at least several times faster and a easier to create. >> Is there any way to more actively >> promote the acquisition and prioritizing of texts that are generally >> recognized as being "better than average" aka "famous" or at least "well >> known"? > >That presumes that that should be our goal. Some of the works I'm >proudest of are works where the PG edition is the best in the world. >Sure, more people may read the Canterbury Tales, but every who reads >our edition of Stephen Hawes's "A Joyful Meditation.... Is it possible to split the queues and the efforts into "esoterica" vs. "books that will be actively read?" Right now the "books that will be actively read" I am afraid are stuck in the queue behind "books that no one is actually willing to work on." I went there recently to try to help and it looked like "the powers that be" were trying to force through books that really no one wants to work on -- books that were really hard and not very interesting even to the people who volunteer their time to DP. You can't force people to work on things they don't want to work on. Either they work on texts that they want to work on, or if DP is not willing to present any of those, they they go on with their lives, or maybe, like in my case, they "route around damage" and work on books outside of DP. The problem is NOT that there is "esoterica" vs. "books that will be actively read" -- the problem is that the "esoterica" takes so much time and effort compared to "books that will be actively read" that "esoterica" ends up swamping the other categories. Are you really saving a book if you pickle it for posterity without it getting read? Isn't that like locking up a ballerina's shoes in order to preserve ballet? Or locking up an artists paint and brushes in order to preserve art? To my taste books exist while they are being read. Otherwise they fail to exist -- beyond little magnetic domains stuck somewhere on the internet. A simple answer would be to put in separate queues for the differing levels of difficulty and/or categories of books. Then people who want to work on esoterica can do so without impacting people who don't. From prosfilaes at gmail.com Fri Feb 12 11:22:07 2010 From: prosfilaes at gmail.com (David Starner) Date: Fri, 12 Feb 2010 14:22:07 -0500 Subject: [gutvol-d] Re: DP: was rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> Message-ID: <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> On Fri, Feb 12, 2010 at 1:47 PM, Jim Adcock wrote: > OK, but how and when do you decide that the PP has actually moved on in life and is not really willing to finish up the book to which others have in good faith contributed their blood sweat and tears in the hopes of getting an honest to god book? ?Not to mention the possibility of a PP not working in good faith? That's not a problem to be solved ranting; it's a problem to be solved by study of the statistics and talking to the PPers. > Is it possible to split the queues and the efforts into "esoterica" vs. "books that will be actively read?" ?Right now the "books that will be actively read" I am afraid are stuck in the queue behind "books that no one is actually willing to work on." ?I went there recently to try to help and it looked like "the powers that be" were trying to force through books that really no one wants to work on -- books that were really hard and not very interesting even to the people who volunteer their time to DP. This is the funny thing; there's no connection between books that will be actively read, and books people want to work on. What books would be actively read: Euclid, Newton's Principia, the Oxford English Dictionary. We've had scans of the OED for years; no one has been willing to attack it. We can probably come up with a dozen usable scans of Euclid; no one is currently working on getting PG a complete copy of Euclid, because it's a total pain to work on. But you take some moldy old historical fiction or better yet some sci-fi story that hasn't been reprinted since it was first published, and they will rocket through DP. > The problem is NOT that there is "esoterica" vs. "books that will be actively read" -- the problem is that the "esoterica" takes so much time and effort compared to "books that will be actively read" that "esoterica" ends up swamping the other categories. Bullshit. How long do you think the OED would take? That's a book that will be actively read. Why did "Dryden's Works (13 of 18): Translations; Pastorals" take two months to go through P2? If you're classifying the complete works of Dryden as esoterica, then what on Earth are you classifying as books that will be actively read? Certainly not the historical trash fiction that does blow through DP. -- Kie ekzistas vivo, ekzistas espero. From klofstrom at gmail.com Fri Feb 12 11:27:04 2010 From: klofstrom at gmail.com (Karen Lofstrom) Date: Fri, 12 Feb 2010 09:27:04 -1000 Subject: [gutvol-d] Re: DP: was rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> Message-ID: <1e8e65081002121127r77da6414y32a6f4b00f35d6fc@mail.gmail.com> On Fri, Feb 12, 2010 at 8:47 AM, Jim Adcock wrote: > OK, but I can also point to texts that were almost "good to go" before they went into DP, only to molder indefinitely there. ?Is there some way to make a decision on this one way or another. How about letting the PM make the decision whether or not to post a "preliminary version" to PG? The poll that's up at DP right now has the respondents just about evenly split on this issue. I would be OK with doing it, but I also understand those who feel that the "preliminary" posting might hang around for years, displacing the final, polished, ACCURATE product. > Is it possible to split the queues and the efforts into "esoterica" vs. "books that will be actively read?" No. That's like recommending that publishers solve their financial problems by only printing best-sellers. Some books that YOU think are esoterica might actually be of great interest to a small but appreciative community, such as scholars the world over. Take, for example, the Baburnama, the memoirs of Babur, the Turki conqueror of northern South Asia and founder of the Mughal dynasty, as translated by Beveridge. Fiendishly difficult text, took a year to get through P3, will probably take a lot of time in F1 and F2 and PP, a real slog ... but it's an essential work in South Asian history and I'm sure that it will be of great use to students and scholars once finished. I don't regret the time I spent on it. >?I went there recently to try to help and it looked like "the powers that be" were trying to force through books that really no one wants to work on -- books that were really hard and not very interesting even to the people who volunteer their time to DP. There's no forcing going on. The policy from Day One has been that we work on what the content providers submit. Sometimes works that look enticing or valuable to them aren't appealing to the proofers, and then take a long time to wend their way through the system. (Some texts, like Greg Week's science fiction stories, zip through in days.) The problem is that the mouldie oldies clog the queues. There have been quite a few proposals for changing the queue system and the round system, and some experiments are running right now. We'll see what happens. DP made a HUGE change when it moved to five rounds rather than two, and I think it will be able to change again. -- Karen Lofstrom You can't force people to work on things they don't want to work on. Either they work on texts that they want to work on, or if DP is not willing to present any of those, they they go on with their lives, or maybe, like in my case, they "route around damage" and work on books outside of DP. > > The problem is NOT that there is "esoterica" vs. "books that will be actively read" -- the problem is that the "esoterica" takes so much time and effort compared to "books that will be actively read" that "esoterica" ends up swamping the other categories. > > Are you really saving a book if you pickle it for posterity without it getting read? ?Isn't that like locking up a ballerina's shoes in order to preserve ballet? Or locking up an artists paint and brushes in order to preserve art? To my taste books exist while they are being read. ?Otherwise they fail to exist -- beyond little magnetic domains stuck somewhere on the internet. > > A simple answer would be to put in separate queues for the differing levels of difficulty and/or categories of books. Then people who want to work on esoterica can do so without impacting people who don't. > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From grythumn at gmail.com Fri Feb 12 11:45:39 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Fri, 12 Feb 2010 14:45:39 -0500 Subject: [gutvol-d] Re: DP: was rfrank reports in In-Reply-To: <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> Message-ID: <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> On Fri, Feb 12, 2010 at 2:22 PM, David Starner wrote: > Dictionary. We've had scans of the OED for years; no one has been > willing to attack it. We can probably come up with a dozen usable Not exactly true. I have a clearance on it, and have a fascicle prepped and at DP. The holdup is that I have yet to come up with a good markup for proofing that can be machine transformed into various dictionary formats. Straight TEI is too big, and likely to lead to inconsistencies. I refuse to start something this big without a decent plan for the final output. Granted, once started, it will probably take decades to work through DP... -R C (Who is somewhat easily distracted, and has been working on other projects.) From ajhaines at shaw.ca Fri Feb 12 12:05:20 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Fri, 12 Feb 2010 12:05:20 -0800 Subject: [gutvol-d] Re: DP: was rfrank reports in References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> Message-ID: <18CC2C23FCF249DEA672595196E236B2@alp2400> Speaking as a Whitewasher (and probably for the other WWers, too), I have absolutely no interest in posting a "preliminary" version of something if a "revised" version is going to appear in a few days/weeks/months, requiring me to re-do the posting process. Ditto for posting a text-only version if an HTML version is in the works. My PG priorities are my own productions first, followed by WWing, then Errata and Reposts. My own productions are not going to be allowed to suffer just because someone is in a rush to get a preliminary version out the door. I can always create another priority--"No Rush Whatsover". In short, it's MY time I volunteer to PG, and it's not yours to waste. Al ----- Original Message ----- From: "Jim Adcock" To: "'Project Gutenberg Volunteer Discussion'" Sent: Friday, February 12, 2010 10:47 AM Subject: [gutvol-d] DP: was rfrank reports in > >To which I pointed out that this would in many cases result in the > posting of severely deficient texts. Formatting is important. > > OK, but I can also point to texts that were almost "good to go" before > they went into DP, only to molder indefinitely there. Is there some way > to make a decision on this one way or another. How about letting the PM > make the decision whether or not to post a "preliminary version" to PG? > >>Because sometimes it may be worth letting a text molder rather then > preemptorially ripping it out of someone's hands and annoying the hell > out of them. > > OK, but how and when do you decide that the PP has actually moved on in > life and is not really willing to finish up the book to which others have > in good faith contributed their blood sweat and tears in the hopes of > getting an honest to god book? Not to mention the possibility of a PP not > working in good faith? > >>Is guiguts not quick enough for you? This is a fairly simple tool problem. > > Tried it previously and didn't find any value in it. I will take a look > at it again. > >>It's easy to come up with a rhetorically stupid title. But if you > pulled a real title, then we could actually discuss the audience and > why someone would upload that. > > Pick any title active in the rounds right now. Based on the best > statistics I can find on PG usage, which is actually from IA, the most > popular books from PG get read literally 100,000 times more often than the > least read books. Now, it is hard to find a book that is going to be that > popular. But it is easy to find a good book which will get read literally > 40x more often than the books in DP right now, as well as being at least > several times faster and a easier to create. > >>> Is there any way to more actively >>> promote the acquisition and prioritizing of texts that are generally >>> recognized as being "better than average" aka "famous" or at least "well >>> known"? >> >>That presumes that that should be our goal. Some of the works I'm >>proudest of are works where the PG edition is the best in the world. >>Sure, more people may read the Canterbury Tales, but every who reads >>our edition of Stephen Hawes's "A Joyful Meditation.... > > Is it possible to split the queues and the efforts into "esoterica" vs. > "books that will be actively read?" Right now the "books that will be > actively read" I am afraid are stuck in the queue behind "books that no > one is actually willing to work on." I went there recently to try to help > and it looked like "the powers that be" were trying to force through books > that really no one wants to work on -- books that were really hard and not > very interesting even to the people who volunteer their time to DP. You > can't force people to work on things they don't want to work on. Either > they work on texts that they want to work on, or if DP is not willing to > present any of those, they they go on with their lives, or maybe, like in > my case, they "route around damage" and work on books outside of DP. > > The problem is NOT that there is "esoterica" vs. "books that will be > actively read" -- the problem is that the "esoterica" takes so much time > and effort compared to "books that will be actively read" that "esoterica" > ends up swamping the other categories. > > Are you really saving a book if you pickle it for posterity without it > getting read? Isn't that like locking up a ballerina's shoes in order to > preserve ballet? Or locking up an artists paint and brushes in order to > preserve art? To my taste books exist while they are being read. > Otherwise they fail to exist -- beyond little magnetic domains stuck > somewhere on the internet. > > A simple answer would be to put in separate queues for the differing > levels of difficulty and/or categories of books. Then people who want to > work on esoterica can do so without impacting people who don't. > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From greg at durendal.org Fri Feb 12 13:45:15 2010 From: greg at durendal.org (Greg Weeks) Date: Fri, 12 Feb 2010 16:45:15 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Re: DP: was rfrank reports in In-Reply-To: <18CC2C23FCF249DEA672595196E236B2@alp2400> References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <18CC2C23FCF249DEA672595196E236B2@alp2400> Message-ID: On Fri, 12 Feb 2010, Al Haines (shaw) wrote: > Speaking as a Whitewasher (and probably for the other WWers, too), I have > absolutely no interest in posting a "preliminary" version of something if a > "revised" version is going to appear in a few days/weeks/months, requiring me > to re-do the posting process. Ditto for posting a text-only version if an > HTML version is in the works. The proposal isn't to "post" to PG at all, but to something like preprints.readingroo.ms but entirely automated. -- Greg Weeks http://durendal.org:8080/greg/ From sly at victoria.tc.ca Fri Feb 12 15:06:24 2010 From: sly at victoria.tc.ca (Andrew Sly) Date: Fri, 12 Feb 2010 15:06:24 -0800 (PST) Subject: [gutvol-d] Re: [SPAM] Re: Re: DP: was rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <18CC2C23FCF249DEA672595196E236B2@alp2400> Message-ID: I agree that's what has been said in discussions on the DP forums. I would argue that intent was not clear from what's been been posted on this list. >From what I've seen, it's hard to stay focused on one concept, because everyone starts dragging in their own concerns on marginally related topics and making those the main focus. --Andrew On Fri, 12 Feb 2010, Greg Weeks wrote: > On Fri, 12 Feb 2010, Al Haines (shaw) wrote: > > > Speaking as a Whitewasher (and probably for the other WWers, too), I have > > absolutely no interest in posting a "preliminary" version of something if a > > "revised" version is going to appear in a few days/weeks/months, requiring me > > to re-do the posting process. Ditto for posting a text-only version if an > > HTML version is in the works. > > The proposal isn't to "post" to PG at all, but to something like > preprints.readingroo.ms but entirely automated. > > From dakretz at gmail.com Fri Feb 12 15:33:20 2010 From: dakretz at gmail.com (don kretz) Date: Fri, 12 Feb 2010 15:33:20 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: DP: was rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <18CC2C23FCF249DEA672595196E236B2@alp2400> Message-ID: <627d59b81002121533y422adbf4uecf18d0f234922f4@mail.gmail.com> This is very interesting viewed from both sites. We have one spokesman for PG suggesting that, for the purposes of increasing the rate of increasing the stock, using text at some level of markup sophistication (which I seem to remember strongly featured text but not HTML), there was some possibility of room for flexibility (or something like that.) There immediately erupted on DP two substantially differing (mis)interpretations of what this might mean. Both of them have received responses I'd characterize as mostly indifference to revulsion, with a few outliers on both sides. Now we have a second PG spokesman, and the score here seems to be one vote for "maybe" and another for "hell no". Not much basis left for discussion, but we get a lot of productive venting done on both sides. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Feb 12 16:58:11 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 12 Feb 2010 19:58:11 EST Subject: [gutvol-d] Re: rfrank reports in Message-ID: <1830d.5e4b7dd4.38a75323@aol.com> what a convoy of clowns... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg at durendal.org Fri Feb 12 17:26:29 2010 From: greg at durendal.org (Greg Weeks) Date: Fri, 12 Feb 2010 20:26:29 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: DP: was rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <18CC2C23FCF249DEA672595196E236B2@alp2400> Message-ID: Yes, after the initial proposal, there were 5 or six other proposals made in the same thread to drag it off in odd courses. As far as I can see there's only one person actually doing anything other than argue. That's hanne_dk and I hope to see an automated script to process DPs intermediate files into something that doesn't look too bad for most texts. It appears it'll never get "official" approval as there's too many people adamantly against doing to to "their" texts. Oh well. Greg Weeks On Fri, 12 Feb 2010, Andrew Sly wrote: > I agree that's what has been said in discussions on the DP forums. > I would argue that intent was not clear from what's been been posted > on this list. > >> From what I've seen, it's hard to stay focused on one concept, because > everyone starts dragging in their own concerns on marginally related > topics and making those the main focus. > > --Andrew > > On Fri, 12 Feb 2010, Greg Weeks wrote: > >> On Fri, 12 Feb 2010, Al Haines (shaw) wrote: >> >>> Speaking as a Whitewasher (and probably for the other WWers, too), I have >>> absolutely no interest in posting a "preliminary" version of something if a >>> "revised" version is going to appear in a few days/weeks/months, requiring me >>> to re-do the posting process. Ditto for posting a text-only version if an >>> HTML version is in the works. >> >> The proposal isn't to "post" to PG at all, but to something like >> preprints.readingroo.ms but entirely automated. >> >> > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -- Greg Weeks http://durendal.org:8080/greg/ From dakretz at gmail.com Fri Feb 12 17:49:41 2010 From: dakretz at gmail.com (don kretz) Date: Fri, 12 Feb 2010 17:49:41 -0800 Subject: [gutvol-d] {Disarmed} Re: [SPAM] Re: Re: [SPAM] Re: Re: DP: was rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <18CC2C23FCF249DEA672595196E236B2@alp2400> Message-ID: <627d59b81002121749y55c6d30bp2f889a7878bb9abb@mail.gmail.com> Well, to close the circle, if they were posted, they would be (I say this advisedly) *perfect* fodder for, say, an offline utility program to run automated checks and do basic formatting. I bet in most cases one person could whip up a high-quality post-PP equivalent in, say, a day or two. On Fri, Feb 12, 2010 at 5:26 PM, Greg Weeks wrote: > > Yes, after the initial proposal, there were 5 or six other proposals made > in the same thread to drag it off in odd courses. As far as I can see > there's only one person actually doing anything other than argue. That's > hanne_dk and I hope to see an automated script to process DPs intermediate > files into something that doesn't look too bad for most texts. It appears > it'll never get "official" approval as there's too many people adamantly > against doing to to "their" texts. Oh well. > > Greg Weeks > > > On Fri, 12 Feb 2010, Andrew Sly wrote: > > I agree that's what has been said in discussions on the DP forums. >> I would argue that intent was not clear from what's been been posted >> on this list. >> >> From what I've seen, it's hard to stay focused on one concept, because >>> >> everyone starts dragging in their own concerns on marginally related >> topics and making those the main focus. >> >> --Andrew >> >> On Fri, 12 Feb 2010, Greg Weeks wrote: >> >> On Fri, 12 Feb 2010, Al Haines (shaw) wrote: >>> >>> Speaking as a Whitewasher (and probably for the other WWers, too), I >>>> have >>>> absolutely no interest in posting a "preliminary" version of something >>>> if a >>>> "revised" version is going to appear in a few days/weeks/months, >>>> requiring me >>>> to re-do the posting process. Ditto for posting a text-only version if >>>> an >>>> HTML version is in the works. >>>> >>> >>> The proposal isn't to "post" to PG at all, but to something like >>> preprints.readingroo.ms but entirely automated. >>> >>> >>> _______________________________________________ >> gutvol-d mailing list >> gutvol-d at lists.pglaf.org >> http://lists.pglaf.org/mailman/listinfo/gutvol-d >> >> > -- > Greg Weeks > http://durendal.org:8080/greg/ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ke at gnu.franken.de Fri Feb 12 18:26:24 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Sat, 13 Feb 2010 03:26:24 +0100 Subject: [gutvol-d] Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> (Robert Cicconetti's message of "Fri, 12 Feb 2010 14:45:39 -0500") References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> Message-ID: Robert Cicconetti writes: > Not exactly true. I have a clearance on it, and have a fascicle > prepped and at DP. The holdup is that I have yet to come up with a > good markup for proofing that can be machine transformed into various > dictionary formats. Lame excuse ;) The proofing rounds are easy (and you only see the difficulties, once you actually let the crowd work on it). > Straight TEI is too big, and likely to lead to inconsistencies. I > refuse to start something this big without a decent plan for the final > output. I'd recommend to do all "formatting" (= XML tagging) off-site. It would probably the best to use SVN or git/bazar for collaboration. Any idea where we could host such a repository? -- Karl Eichwalder From dakretz at gmail.com Fri Feb 12 18:34:48 2010 From: dakretz at gmail.com (don kretz) Date: Fri, 12 Feb 2010 18:34:48 -0800 Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> Message-ID: <627d59b81002121834i2edda5a4n9e4efdc5d162afb9@mail.gmail.com> Google Code On Fri, Feb 12, 2010 at 6:26 PM, Karl Eichwalder wrote: > Robert Cicconetti writes: > > > Not exactly true. I have a clearance on it, and have a fascicle > > prepped and at DP. The holdup is that I have yet to come up with a > > good markup for proofing that can be machine transformed into various > > dictionary formats. > > Lame excuse ;) The proofing rounds are easy (and you only see the > difficulties, once you actually let the crowd work on it). > > > Straight TEI is too big, and likely to lead to inconsistencies. I > > refuse to start something this big without a decent plan for the final > > output. > > I'd recommend to do all "formatting" (= XML tagging) off-site. It would > probably the best to use SVN or git/bazar for collaboration. Any idea > where we could host such a repository? > > -- > Karl Eichwalder > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From grythumn at gmail.com Fri Feb 12 19:16:21 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Fri, 12 Feb 2010 22:16:21 -0500 Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> Message-ID: <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> On Fri, Feb 12, 2010 at 9:26 PM, Karl Eichwalder wrote: > Robert Cicconetti writes: > >> Not exactly true. I have a clearance on it, and have a fascicle >> prepped and at DP. The holdup is that I have yet to come up with a >> good markup for proofing that can be machine transformed into various >> dictionary formats. > > Lame excuse ;) ?The proofing rounds are easy (and you only see the > difficulties, once you actually let the crowd work on it). Not really. The OED uses a predecessor of IPA with some oddball symbols.. at the least I have to come up with a table for those or they'll be all over the place. I started one, need to finish it. >> Straight TEI is too big, and likely to lead to inconsistencies. I >> refuse to start something this big without a decent plan for the final >> output. > > I'd recommend to do all "formatting" (= XML tagging) off-site. ?It would > probably the best to use SVN or git/bazar for collaboration. ?Any idea > where we could host such a repository? I'm not prepared to abandon the DP workflow, especially for a project of this scale, and considering the amount of markup that will be required. At DP I reasonably assume it'll keep moving, even if I drop off the grid or get hit by a bus. -R C From dakretz at gmail.com Fri Feb 12 20:04:18 2010 From: dakretz at gmail.com (don kretz) Date: Fri, 12 Feb 2010 20:04:18 -0800 Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> Message-ID: <627d59b81002122004j376a99fdofbd0f70b0e2df8df@mail.gmail.com> Here's a point of reference for you. The current Encyclop?dia Britannica project in F2 has been there since September. Number of pages: 232 Pages remaining: 24 Pages I've done: 203 Pages other people have done: 5 Some rounds get cherry-picked pretty badly; and OED is not a cherry. Stay away from buses. On Fri, Feb 12, 2010 at 7:16 PM, Robert Cicconetti wrote: > On Fri, Feb 12, 2010 at 9:26 PM, Karl Eichwalder > wrote: > > Robert Cicconetti writes: > > I'm not prepared to abandon the DP workflow, especially for a project > of this scale, and considering the amount of markup that will be > required. At DP I reasonably assume it'll keep moving, even if I drop > off the grid or get hit by a bus. > > -R C > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ke at gnu.franken.de Sat Feb 13 00:17:14 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Sat, 13 Feb 2010 09:17:14 +0100 Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> (Robert Cicconetti's message of "Fri, 12 Feb 2010 22:16:21 -0500") References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> Message-ID: Robert Cicconetti writes: > Not really. The OED uses a predecessor of IPA with some oddball > symbols.. at the least I have to come up with a table for those or > they'll be all over the place. I started one, need to finish it. You could consider processing it at dp-canada or dp-int--both are UTF-8 enabled. >> I'd recommend to do all "formatting" (= XML tagging) off-site. ?It would >> probably the best to use SVN or git/bazar for collaboration. ?Any idea >> where we could host such a repository? > > I'm not prepared to abandon the DP workflow, especially for a project > of this scale, and considering the amount of markup that will be > required. At DP I reasonably assume it'll keep moving, even if I drop > off the grid or get hit by a bus. That's why I propose to use a public repository. Of course, you would leave a appropriate comment on the project page. Doing TEI tagging page-wise is cumbersome. Doing TEI tagging off-site using your XML editor is much better. -- Karl Eichwalder From grythumn at gmail.com Sat Feb 13 06:39:05 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Sat, 13 Feb 2010 09:39:05 -0500 Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: <627d59b81002122004j376a99fdofbd0f70b0e2df8df@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> <627d59b81002122004j376a99fdofbd0f70b0e2df8df@mail.gmail.com> Message-ID: <15cfa2a51002130639l7fd30080k948fc7998cc3ad72@mail.gmail.com> On Fri, Feb 12, 2010 at 11:04 PM, don kretz wrote: > Here's a point of reference for you. > The current Encyclop?dia Britannica project in F2 has been there since > September. > Number of pages: ? ? ? ? ? ? ? ? 232 > Pages remaining: ? ? ? ? ? ? ? ? ?24 > Pages I've done: ? ? ? ? ? ? ? ? 203 > Pages other people have done: ? ? ?5 > Some rounds get cherry-picked pretty badly; and OED is not a cherry. > Stay away from buses. I might be willing to do a parallel F1 / merge, and automated markup check for F2 skip if I don't have to find a PP in advance. Let me be blunt... I'm easily distracted; doing this kind of markup would drive me nuts quickly and result in orphaned projects. I'll prep the images, run OCR, answer questions, write the code to do the automated checks. But I don't PP or format. -Bob From ke at gnu.franken.de Sat Feb 13 09:12:36 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Sat, 13 Feb 2010 18:12:36 +0100 Subject: [gutvol-d] Re: Using SVN or git/bazar In-Reply-To: <627d59b81002121834i2edda5a4n9e4efdc5d162afb9@mail.gmail.com> (don kretz's message of "Fri, 12 Feb 2010 18:34:48 -0800") References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> <627d59b81002121834i2edda5a4n9e4efdc5d162afb9@mail.gmail.com> Message-ID: don kretz writes: > Google Code Why not? ;) I just created http://code.google.com/p/tieck-texts/ and seeded it with 'Briefe an Ludwig Tieck (1 of 4) {fraktur} {type-in}'. I'll update the project comments later. Wondering whether Google will accept this project... -- Karl Eichwalder From schultzk at uni-trier.de Sat Feb 13 10:03:24 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Sat, 13 Feb 2010 19:03:24 +0100 Subject: [gutvol-d] Re: DP: was rfrank reports in In-Reply-To: <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> Message-ID: Hi Robert, As far as a markup is concerned I would suggest using TeX or XeTeX. For one you can encode all the information we want as you want. Such as \entry, \pronunciation, \meaning, \synonym, etc, you name it. Then either write comands for formating or a TeX script to produce the desired output, or use any other language to process the data. Another way to go is use XML to encode the data and take it from there. Eitherway you have full control of the input data and output. regards Keith Am 12.02.2010 um 20:45 schrieb Robert Cicconetti: > On Fri, Feb 12, 2010 at 2:22 PM, David Starner wrote: >> Dictionary. We've had scans of the OED for years; no one has been >> willing to attack it. We can probably come up with a dozen usable > > Not exactly true. I have a clearance on it, and have a fascicle > prepped and at DP. The holdup is that I have yet to come up with a > good markup for proofing that can be machine transformed into various > dictionary formats. Straight TEI is too big, and likely to lead to > inconsistencies. I refuse to start something this big without a decent > plan for the final output. > > Granted, once started, it will probably take decades to work through DP... From dakretz at gmail.com Sat Feb 13 10:23:06 2010 From: dakretz at gmail.com (don kretz) Date: Sat, 13 Feb 2010 10:23:06 -0800 Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: <15cfa2a51002130639l7fd30080k948fc7998cc3ad72@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> <627d59b81002122004j376a99fdofbd0f70b0e2df8df@mail.gmail.com> <15cfa2a51002130639l7fd30080k948fc7998cc3ad72@mail.gmail.com> Message-ID: <627d59b81002131023h6d2ec44r842d268d941d79e0@mail.gmail.com> A little more information. That same project (which is 232 pages, about 1/8 of one volume out of 29 volumes.) It was being P3 proofread from April to November of 2007 (about 7 months). Then it sat in queues with no one working on it from Nov. 2007 to Sept. 2009 (almost two years) except for a brief spell (3 months) when it was in F1. And that was pretty speedy. A new project (such as I'm preparing now, which will be 300+ pages,) will not be quite so fortunate, because now the queues are much longer; and more significantly, there will be many more EB volumes ahead of it when it gets to each queue. So I'd be prepared to spend some time proofing at least (if you don't prefer formatting and PP) so help it along in those brief windows of opportunity (roughly 9-12 months) when it's available to anyone at all. (But given well-established trends, it will probably be much longer.) Fortunately, you'll have lots of time to scan and OCR each project. In fact, I bet you'll be so fortunate as to have a new generation of scanning technology available every couple of projects or so. It may easily take longer to proof, format, and publish the ebook than it took for the original - an acknowledged epic in itself. For sure, it could be re-typeset in a small fraction of the time. On Sat, Feb 13, 2010 at 6:39 AM, Robert Cicconetti wrote: > On Fri, Feb 12, 2010 at 11:04 PM, don kretz wrote: > > Here's a point of reference for you. > > The current Encyclop?dia Britannica project in F2 has been there since > > September. > > Number of pages: 232 > > Pages remaining: 24 > > Pages I've done: 203 > > Pages other people have done: 5 > > Some rounds get cherry-picked pretty badly; and OED is not a cherry. > > Stay away from buses. > > I might be willing to do a parallel F1 / merge, and automated markup > check for F2 skip if I don't have to find a PP in advance. > > Let me be blunt... I'm easily distracted; doing this kind of markup > would drive me nuts quickly and result in orphaned projects. I'll prep > the images, run OCR, answer questions, write the code to do the > automated checks. But I don't PP or format. > > -Bob > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Sat Feb 13 10:32:28 2010 From: dakretz at gmail.com (don kretz) Date: Sat, 13 Feb 2010 10:32:28 -0800 Subject: [gutvol-d] Re: DP: was rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> Message-ID: <627d59b81002131032q629937aeo21ce4ca1cca02693@mail.gmail.com> You might want to work something out with these guys to keep track of your project logs after you're gone. > > Am 12.02.2010 um 20:45 schrieb Robert Cicconetti: > > > > Granted, once started, it will probably take decades to work through > DP... > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Sat Feb 13 10:51:50 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Sat, 13 Feb 2010 13:51:50 -0500 Subject: [gutvol-d] Many solo projects out there in gutvol-d land? Message-ID: <4B76F4C6.3030006@teksavvy.com> I've done a few books for PG. I've used DP -- back in the day, but mostly I've been doing solo projects. I don't hear a lot about folks doing projects solo these days. Are there many of us out there? ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From sly at victoria.tc.ca Sat Feb 13 11:49:14 2010 From: sly at victoria.tc.ca (Andrew Sly) Date: Sat, 13 Feb 2010 11:49:14 -0800 (PST) Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <4B76F4C6.3030006@teksavvy.com> References: <4B76F4C6.3030006@teksavvy.com> Message-ID: I suspect there are. You just don't see a lot of communication between them. I'm often checking newly posted texts for the catalog records, and I do notice credits sometimes that do not mention dp. I do projects on my own sometimes. I know Al Haines does many. I recall seeing a few religious texts lately from an individual contributor. --Andrew On Sat, 13 Feb 2010, Gardner Buchanan wrote: > I've done a few books for PG. I've used DP -- back in the day, > but mostly I've been doing solo projects. I don't hear a lot > about folks doing projects solo these days. Are there many of > us out there? > From ajhaines at shaw.ca Sat Feb 13 12:25:48 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sat, 13 Feb 2010 12:25:48 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? References: <4B76F4C6.3030006@teksavvy.com> Message-ID: I've produced many books single-handed, from scan to post, for both PG-US and PG-Canada. As a Whitewasher, I've encountered maybe a dozen solo producers. Many of the first-timers I deal with aren't prepared for the work involved in producing a book, and abandon their projects. Very few become multi-project submitters. Abandoned projects are not lost. My practice is to wait a year, then decide if I want to do the book myself. If I do, and I can find a scanset, I get a clearance, and produce the book. Many of the early producers, who did books when etext numbers were less than about 5000, no longer produce. I can think of only a few who do. Gardner, you're one, and David Price and David Widger are others. I didn't start until the very early 10000's--my first book was #10750, released January 2004. Al ----- Original Message ----- From: "Gardner Buchanan" To: "Project Gutenberg Volunteer Discussion" Sent: Saturday, February 13, 2010 10:51 AM Subject: [gutvol-d] Many solo projects out there in gutvol-d land? > I've done a few books for PG. I've used DP -- back in the day, > but mostly I've been doing solo projects. I don't hear a lot > about folks doing projects solo these days. Are there many of > us out there? > > ============================================================ > Gardner Buchanan > Ottawa, ON FreeBSD: Where you want to go. Today. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From dakretz at gmail.com Sat Feb 13 12:53:05 2010 From: dakretz at gmail.com (don kretz) Date: Sat, 13 Feb 2010 12:53:05 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> Message-ID: <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> Interesting. I hadn't realize the two organizations were so closely interdependent. So effectively, PG's release volume is almost directly dependent on DP's posting volume. And whatever validation requirements PG might have don't have much relevance if they differ from DP's requirements, as long as the WWers don't reject them. DP is the publisher, and PG is the distributor (roughly speaking). -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sat Feb 13 20:47:08 2010 From: prosfilaes at gmail.com (David Starner) Date: Sat, 13 Feb 2010 23:47:08 -0500 Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> Message-ID: <6d99d1fd1002132047q3547ec29x8a82ba42efba6dd9@mail.gmail.com> On Sat, Feb 13, 2010 at 3:17 AM, Karl Eichwalder wrote: > Robert Cicconetti writes: > >> Not really. The OED uses a predecessor of IPA with some oddball >> symbols.. at the least I have to come up with a table for those or >> they'll be all over the place. I started one, need to finish it. > > You could consider processing it at dp-canada or dp-int--both are UTF-8 > enabled. I have two problems with that. One, I'm not sure all the symbols are in Unicode. Two, just making Unicode available doesn't overcome the problems that these characters are not on any physical keyboards and only the most esoteric software keyboards. Even with Unicode available, if it were pure IPA, I'd go with SAMPA. -- Kie ekzistas vivo, ekzistas espero. From sly at victoria.tc.ca Sat Feb 13 23:20:25 2010 From: sly at victoria.tc.ca (Andrew Sly) Date: Sat, 13 Feb 2010 23:20:25 -0800 (PST) Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> Message-ID: On Sat, 13 Feb 2010, don kretz wrote: > Interesting. I hadn't realize the two organizations were so closely > interdependent. Well, yes. There is a lot of interplay and adaption between the two. But I would not say that either is dependant upon the other for its existence. If PG were to somehow disappear or close down, I'm sure that DP would continue, finding another repository for its finished texts--or creating one if needed. And if DP were to disappear, PG would go on just as it always has, only with a much lower volume of texts being posted. > So effectively, PG's release volume is almost directly dependent on DP's > posting volume. The majority of new PG texts for many years have come frome DP, yes. For a quick comparison, I see that DP's 15,000th text was posted on May 12, 2009. They will have done many more since then, and have by now done more than half of the 31,000 odd items in PG. A while ago, I added this to the Wikipedia article on Project Gutenberg, to try to clarify what effect DP had had on it: "This effort greatly increased the number and variety of texts being added to Project Gutenberg, as well as making it easier for new volunteers to start contributing." I could go on describing the hows and wherefores of that in more detail, but this is getting too long already. > And whatever validation requirements PG might have don't have much relevance > if > they differ from DP's requirements, as long as the WWers don't reject them. Well, that has been part of the balancing act, if you will. PG has always adapted (albeit, sometimes slowly) according to its contributors. And DP contributors, after conversations back and forth, have helped to shape what direction PG is going in. One example that comes to mind is dropping the requirement that a text be of a certain length, in order to accomodate all the sci-fi short stories. In my own opinion, this can be difficult, becuase there are many parts that make up this process of DP-PG. Sometimes people make suggestions that seem good from their point of view, but very few seem to have an accurate over-all picture, to know how one action can affect other parts of the process. > DP is the publisher, and PG is the distributor (roughly speaking). I don't know if that metaphor fits perfectly. Project Gutenberg itself seems to fill more of the publishers role, as well as distributor and archiver. DP does what might be compared to the traditional roles of type-setter, proofreader, fact-checker, etc. And don't underestimate the role of the post-processor. It still comes down to one person who has to do a lot of work on the text, and often make descisions about how to deal with many various things, before it is ready for submitting to PG. --Andrew From dakretz at gmail.com Sun Feb 14 00:09:44 2010 From: dakretz at gmail.com (don kretz) Date: Sun, 14 Feb 2010 00:09:44 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> Message-ID: <627d59b81002140009u4a42a463hbab6742d65c7b310@mail.gmail.com> Not to worry - the last thing any of us do is undervalue the post-processor. The job just seems to become more complex, and the amount of value-add the provide beyond what the rest of us do keeps increasing. I don't think anyone is particularly happy about that, least of all the PPers.They're the smallest piece of pipe everything has to fit through, and they aren't getting much help in the way of tool support. On Sat, Feb 13, 2010 at 11:20 PM, Andrew Sly wrote: > > And don't underestimate the role of the post-processor. > It still comes down to one person who has to do a lot of work > on the text, and often make descisions about how to deal with > many various things, before it is ready for submitting to > PG. > > > --Andrew > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ke at gnu.franken.de Sun Feb 14 00:42:23 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Sun, 14 Feb 2010 09:42:23 +0100 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: (Andrew Sly's message of "Sat, 13 Feb 2010 23:20:25 -0800 (PST)") References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> Message-ID: Andrew Sly writes: > And don't underestimate the role of the post-processor. > It still comes down to one person who has to do a lot of work > on the text, and often make descisions about how to deal with > many various things, before it is ready for submitting to > PG. I think we can change this. It would be much better to do this mysterious PP'ing in a collaborative manner. To experience this, I created an SVN repository and started with TEI tagging. I'll add more of the PGTEI framework soon: http://code.google.com/p/tieck-texts/ ATM, there is just one book and one contributor. More to come--thus far I did not announce it widely. pgdp seems to be down right now... -- Karl Eichwalder From traverso at posso.dm.unipi.it Sun Feb 14 01:02:52 2010 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Sun, 14 Feb 2010 10:02:52 +0100 (CET) Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: <6d99d1fd1002132047q3547ec29x8a82ba42efba6dd9@mail.gmail.com> (message from David Starner on Sat, 13 Feb 2010 23:47:08 -0500) References: <8005.73d837a3.38a06575@aol.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> <6d99d1fd1002132047q3547ec29x8a82ba42efba6dd9@mail.gmail.com> Message-ID: <20100214090252.504D6FFB4@cardano.dm.unipi.it> >>>>> "David" == David Starner writes: David> On Sat, Feb 13, 2010 at 3:17 AM, Karl Eichwalder David> wrote: >> Robert Cicconetti writes: >> >>> Not really. The OED uses a predecessor of IPA with some >>> oddball symbols.. at the least I have to come up with a table >>> for those or they'll be all over the place. I started one, >>> need to finish it. >> You could consider processing it at dp-canada or dp-int--both >> are UTF-8 enabled. David> I have two problems with that. One, I'm not sure all the David> symbols are in Unicode. This could be managed with replacements of the few (are they few?) missing characters. Two, just making Unicode available David> doesn't overcome the problems that these characters are not David> on any physical keyboards and only the most esoteric David> software keyboards. This could be managed with a character picker, like the greek and hieroglyph popups in the proofing interface. Or some of the tools in some of Don's project comments. Even with Unicode available, if it were David> pure IPA, I'd go with SAMPA. SAMPA might be OK for publcation, and probably for entering too, but for checking (rounds after the first) it requires to know the correspondence OED/SAMPA. Impossible, except for experts. One might however easily build converters from SAMPA and IPA to OED using the conversion software that is running at DP-EU (convert button in the standard interface). Undocumented, but I know it, and I have both the software and part at least of the conversion tables, and can build in minutes any further table needed. Probably it is something that might be experimented at DP-EU: apparently Nikola is maintaining the converter, and adding a table to it is straightforward (it is an ASCII table). If you want, I can start there a project with a few pages. There is however another worse problem: I am not sure that the OED is free from copyright in Canada or Serbia. Carlo From Bowerbird at aol.com Sun Feb 14 09:17:05 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 14 Feb 2010 12:17:05 EST Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? Message-ID: <1c256.7508109.38a98a11@aol.com> karl said: > I think we can change this.? It would be much better > to do this mysterious PP'ing in a collaborative manner.? > To experience this, I created an SVN repository and > started with TEI tagging.? I'll add more of > the PGTEI framework soon: that's right. to simplify the job of postprocessing, throw in a dose of s.v.n. and then add some t.e.i., and all the complexities will waft away on a breeze. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From donovan at abs.net Sun Feb 14 09:27:28 2010 From: donovan at abs.net (D Garcia) Date: Sun, 14 Feb 2010 12:27:28 -0500 Subject: [gutvol-d] DP Outage [WAS: Re: ... solo projects ...] In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> Message-ID: <201002141227.28318.donovan@abs.net> Karl Eichwalder wrote: > I did not announce it widely. pgdp seems to be down right now... The server is up, the network is down. Unfortunately, our colocation provider is one of many in the NJ/NYC region that has been affected by fiber cuts related to the underground transformer explosion in NYC. Both upstream providers are working at this time to put temporary solutions in place to restore connectivity to these facilities until permanent repairs can be made. We did just obtain an ETA of "a couple more hours" from them via our coloc contact, but that would appear at the moment to be a somewhat optimistic educated guess. Hopefully service will be restored by this evening (Sunday US EST). David (donovan) From dakretz at gmail.com Sun Feb 14 09:27:28 2010 From: dakretz at gmail.com (don kretz) Date: Sun, 14 Feb 2010 09:27:28 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <1c256.7508109.38a98a11@aol.com> References: <1c256.7508109.38a98a11@aol.com> Message-ID: <627d59b81002140927i43ab562bi4ce26ac3de6dcbe6@mail.gmail.com> Yes, it's down again. I'm not sure I see how this would fix anything. You still have the PPers at the crossroads with the (somewhat doubtful) requirement they will accept the opportunity to become intimately familiar with XML. At DP, the cost of doing something is measured in volunteer inconvenience.As a consequence, change is not embraced with much enthusiasm, nor is measurement nor personal responsibility, and responsibility tends to be provided by software coercion. On Sun, Feb 14, 2010 at 9:17 AM, wrote: > karl said: > > I think we can change this. It would be much better > > to do this mysterious PP'ing in a collaborative manner. > > To experience this, I created an SVN repository and > > started with TEI tagging. I'll add more of > > the PGTEI framework soon: > > that's right. to simplify the job of postprocessing, > throw in a dose of s.v.n. and then add some t.e.i., > and all the complexities will waft away on a breeze. > > -bowerbird > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Feb 14 09:46:35 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 14 Feb 2010 12:46:35 EST Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? Message-ID: <1cd29.10b0e871.38a990fb@aol.com> postprocessing at distributed proofreaders is difficult only because the proofers are instructed to throw out meaningful data, which later then needs to be replaced, and the formatters insert obtrusive pseudo-markup, much of which later needs to be reworked or deleted. if the proofers used nonobtrusive zen markup instead, it wouldn't interfere with their proofing task, and there wouldn't need to be a separate formatting task, even if (in reality) people decided to specialize on that aspect. also, the conversion of proofed/formatted pages into a full-on electronic-book should be an automatic process. i've already demonstrated this many times, but would be happy to do it once again, on any book of your choice... this point is particularly important in a roundless system, where the object is to move a page to a "finished" status as quickly as possible. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Sun Feb 14 09:58:22 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun, 14 Feb 2010 18:58:22 +0100 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <627d59b81002140927i43ab562bi4ce26ac3de6dcbe6@mail.gmail.com> References: <1c256.7508109.38a98a11@aol.com> <627d59b81002140927i43ab562bi4ce26ac3de6dcbe6@mail.gmail.com> Message-ID: <4B7839BE.3080206@perathoner.de> don kretz wrote: > At DP, the cost of doing something is measured in volunteer > inconvenience.As a consequence, change is not embraced with > much enthusiasm, nor is measurement nor personal responsibility, > and responsibility tends to be provided by software coercion. That is very short-sighted. The inconvenience for the volunteer should be balanced against the usefulness for the reader. The mindset at PG and DP is that everybody being volunteers they don't have to account for the quality of their work. -- Marcello Perathoner webmaster at gutenberg.org From walter.van.holst at xs4all.nl Sun Feb 14 12:25:36 2010 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Sun, 14 Feb 2010 21:25:36 +0100 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> Message-ID: <4B785C40.5000304@xs4all.nl> On 2/14/10 8:20 AM, Andrew Sly wrote: > And don't underestimate the role of the post-processor. > It still comes down to one person who has to do a lot of work > on the text, and often make descisions about how to deal with > many various things, before it is ready for submitting to > PG. What is it they actually do? Regards, Walter From dakretz at gmail.com Sun Feb 14 12:47:02 2010 From: dakretz at gmail.com (don kretz) Date: Sun, 14 Feb 2010 12:47:02 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <4B785C40.5000304@xs4all.nl> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> Message-ID: <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> On Sun, Feb 14, 2010 at 12:25 PM, Walter van Holst < walter.van.holst at xs4all.nl> wrote: > On 2/14/10 8:20 AM, Andrew Sly wrote: > > And don't underestimate the role of the post-processor. >> It still comes down to one person who has to do a lot of work >> on the text, and often make descisions about how to deal with >> many various things, before it is ready for submitting to >> PG. >> > > What is it they actually do? > > Regards, > > Walter > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > That's a simple question with a complicated answer. Here is an explanation that is apparently as concise as anyone has been able to come up with. As you can see, it's some mixture of: a.) validating all the work done by about 6 Rounds of work on each page in the project; b.) running a bunch of other semi-manual checks on the project; c.) filling the gap caused by the fact that the text markup and layout produced by the Rounds isn't the same as the text format and layout required by PG; d.) producing a complete HTML version of the project based on the format and markup that was originally considered appropriate for the text-only version that was all that PG offered at the time it was designed. So you can see that it's by far the majority of the individual tasks required to produce an e-book (text and html), only a small few of which have been distributed. In some cases the PPer also reproofs the entire project. -------------- next part -------------- An HTML attachment was scrubbed... URL: From grythumn at gmail.com Sun Feb 14 21:49:05 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Mon, 15 Feb 2010 00:49:05 -0500 Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank reports in) In-Reply-To: <20100214090252.504D6FFB4@cardano.dm.unipi.it> References: <8005.73d837a3.38a06575@aol.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com> <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com> <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com> <6d99d1fd1002132047q3547ec29x8a82ba42efba6dd9@mail.gmail.com> <20100214090252.504D6FFB4@cardano.dm.unipi.it> Message-ID: <15cfa2a51002142149q79761674m4796b4870893aa27@mail.gmail.com> On Sun, Feb 14, 2010 at 4:02 AM, Carlo Traverso wrote: > ? ?David> I have two problems with that. One, I'm not sure all the > ? ?David> symbols are in Unicode. > > This could be managed with replacements of the few (are they few?) > missing characters. The OED phonetic alphabet, and an incomplete match to various unicode symbols: http://home.comcast.net/~grythumn/oed/ > There is however another worse problem: I am not sure that the OED is > free from copyright in Canada or Serbia. Better hope there is some sort of corporate work exception... there were several editors, dozens of subeditors, and hundreds of volunteer readers. Not all of whom appear on the title page, but many are listed. -Bob From schultzk at uni-trier.de Mon Feb 15 01:00:32 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 15 Feb 2010 10:00:32 +0100 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> Message-ID: Hi All, Let me see if I understand this right. 6 Rounds of work is done just to be worked over so that text and HTML versions can be created and the final result is published. Why, in goods name is it not done the other way around !! Get a clean text and HTML version and then and all the googly goop after words. Sure would save alot of time. I know DP knows above markup, but have they ever heard about pseudo-code/markup. regards Keith. Am 14.02.2010 um 21:47 schrieb don kretz: > On Sun, Feb 14, 2010 at 12:25 PM, Walter van Holst wrote: > On 2/14/10 8:20 AM, Andrew Sly wrote: > > And don't underestimate the role of the post-processor. > It still comes down to one person who has to do a lot of work > on the text, and often make descisions about how to deal with > many various things, before it is ready for submitting to > PG. > > What is it they actually do? > > > That's a simple question with a complicated answer. > > > Here is an explanation that is apparently as concise as anyone has been able to come up with. > > > As you can see, it's some mixture of: > > a.) validating all the work done by about 6 Rounds of work on each page in the project; > b.) running a bunch of other semi-manual checks on the project; > c.) filling the gap caused by the fact that the text markup and layout produced by the Rounds isn't the same as the text format and layout required by PG; > d.) producing a complete HTML version of the project based on the format and markup that was originally considered appropriate for the text-only version that was all that PG offered at the time it was designed. > > So you can see that it's by far the majority of the individual tasks required to produce an e-book (text and html), only a small few of which have been distributed. > > In some cases the PPer also reproofs the entire project. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Mon Feb 15 09:43:01 2010 From: dakretz at gmail.com (don kretz) Date: Mon, 15 Feb 2010 09:43:01 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> Message-ID: <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> On Mon, Feb 15, 2010 at 1:00 AM, Keith J. Schultz wrote: > Hi All, > > Let me see if I understand this right. > > 6 Rounds of work is done just to be worked over so that text > and HTML versions can be created and the final result > is published. > > Why, in goods name is it not done the other way around !! > Get a clean text and HTML version and then and all the > googly goop after words. Sure would save alot of time. > > I know DP knows above markup, but have they ever > heard about pseudo-code/markup. > > regards > Keith. > > Am 14.02.2010 um 21:47 schrieb don kretz: > > You'd think it would be obvious, wouldn't you? When DP started, here was the basic process as far as the participants were concerned. 1.) A person takes a page of text and a picture of the text, plus a mediocre online text editor and some guidelines for follow, and tries to get the text to match the picture. 2.) A second person takes their work and the same picture and guidelines, and tries to make it better. 3.) The system strings the text files together and hands them off to PG to publish. Clean, simple, and most importantly it provides each person with the immediate and obvous positive gratification of seeing their work self-evidently closing the gap between the text and the picture. Now, almost all the process has been so completely decomposed and constrained that almost all the oppportunity for gratification shows up for a little bit to the first proofer (who still must not do *too much* to make it look like the picture, i.e. format it); maybe the first formatter (if there's even much left to do), and supremely and finally, gloriously, the Post Processor (whose name is associated semi-eternally posted with their work.) There's a whole lot more that can be said (is is said, in the DP forums, loudly, into the vastness of space), about how it got to be this way, and how happy people are about it, and what might be done. These are not dumb people, even though the work seems to have become dumb work. But there's the picture in a nutshell. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Feb 15 10:13:17 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 15 Feb 2010 13:13:17 EST Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? Message-ID: why do you guys insist on hijacking threads? it's rude. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Feb 15 10:32:51 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 15 Feb 2010 13:32:51 EST Subject: [gutvol-d] Re: When DP started, here was the basic process Message-ID: see how easy it is to change the subject-header? *** don said: > When DP started, here was the basic process the irony was, back in those olden days, it was actually much more difficult to digitize a text, because the o.c.r. was horrific, and thus it was a pure pain to proof. nowadays, even though o.c.r. is vastly improved, it seems to take forever for a book to transit d.p. here's an illustrative datapoint i just churned... in one of the books that rfrank is using for his roundless experiment, even tepid preprocessing (which is what he practices) combined with o.c.r. to produce 20% of the book's 240 pages perfectly. another 30% of the pages had only 1 error on 'em. and most of the errors failed spellcheck, meaning they could've been isolated and fixed immediately, without need of a word-by-word proofing modality. d.p. uses dozens of volunteers, taking hours of time, to do something that one person can do in one hour. which would, you know, ordinarily be a very sad thing. except what makes it funny, in this particular case, is that the people at d.p. think they're being "efficient"... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From sly at victoria.tc.ca Mon Feb 15 11:11:03 2010 From: sly at victoria.tc.ca (Andrew Sly) Date: Mon, 15 Feb 2010 11:11:03 -0800 (PST) Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> Message-ID: On Mon, 15 Feb 2010, don kretz wrote: > When DP started, here was the basic process as far as the > participants were concerned. > > 1.) A person takes a page of text and a picture of the text, > plus a mediocre online text editor and some guidelines for > follow, and tries to get the text to match the picture. > > 2.) A second person takes their work and the same picture > and guidelines, and tries to make it better. > > 3.) The system strings the text files together and hands > them off to PG to publish. Are you sure you have phrased that in the way you wanted? At no point in the history of DP was the output of the rounds "strung together and handed directly off to PG". I cannot recall if the name of "post-processor" has always been used--but there has always been someone in that role. Anyone who has worked on PP would know that the output on the rounds at DP is _not_ ready to be posted as a finished text without a good deal more work. But this is ok--this is as intended. The purpose of DP (as I understand it) has always been to distribute much of the work, and make things easier for the person preparing the text for submission to PG. To put this in context, let's compare with pre-DP times, when everything was done on an individual basis. An easy text that has come through DP can be prepared and submitted in one day; a more difficult one can take a week or two; a really hard one might take months working on it on and off. Now take those same texts without the DP preparation, where an individual starts working himself from the ocr output. The easy text could take perhaps three to six weeks; the more difficult one five to eight months or longer; and the hardest texts that have been done through DP could never have been attempted by an individual. One other very significant aspect is that DP has been set up to encourage a sense of community. And you have ready access to people with specialized knowledge about many languages, musical notation, obscure unicode characters, obselete typesetting conventions, etc. In the time before DP it was quite common for somone to put much effort into working on a text, and the burn out and abandon the project. Having DP gives many people a chance to do their bit, and have a much more manageable learning curve. --Andrew From traverso at posso.dm.unipi.it Mon Feb 15 11:27:29 2010 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Mon, 15 Feb 2010 20:27:29 +0100 (CET) Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: (message from Andrew Sly on Mon, 15 Feb 2010 11:11:03 -0800 (PST)) References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> Message-ID: <20100215192729.AFB1DFFB5@cardano.dm.unipi.it> >>>>> "Andrew" == Andrew Sly writes: Andrew> Are you sure you have phrased that in the way you wanted? Andrew> At no point in the history of DP was the output of the Andrew> rounds "strung together and handed directly off to PG". I Andrew> cannot recall if the name of "post-processor" has always Andrew> been used--but there has always been someone in that role. When I started at DP, in 2002, the work needed to pass from the R2 output to posting to PG was officially estimated in 30 minutes, without any specialized tool. I think that "strung together and handed directly off to PG" is a correct metaphor for 30 minutes of work. Enough to remove the separators, reflow the line ends, and that was all. No formatting (italics converted to uppercase for ship names), accents removed, no spell-checking, no gutcheck. This was a task of the project manager, and handing the task to somebody else was exceptional. Of course, even then, it took to me much longer to complete a book, since I used to re-read the book to catch a bunch of remaining errors. Carlo From dakretz at gmail.com Mon Feb 15 12:20:30 2010 From: dakretz at gmail.com (don kretz) Date: Mon, 15 Feb 2010 12:20:30 -0800 Subject: [gutvol-d] Re: When DP started, here was the basic process In-Reply-To: References: Message-ID: <627d59b81002151220j6816e805m5806a2424011125@mail.gmail.com> Very good. So it should now be obvious to you why the thread was not hijacked, it was providing valuable background for roger's thinly-disguised "experiment". -------------- next part -------------- An HTML attachment was scrubbed... URL: From klofstrom at gmail.com Mon Feb 15 12:47:16 2010 From: klofstrom at gmail.com (Karen Lofstrom) Date: Mon, 15 Feb 2010 10:47:16 -1000 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> Message-ID: <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> On Mon, Feb 15, 2010 at 7:43 AM, don kretz wrote: Re the two-round system: > Clean, simple, and most importantly it provides each person > with the immediate and obvous positive gratification of > seeing their work self-evidently closing the gap between > the text and the picture. Yes, and it often produced godawful results. If the R2 proofrer was sloppy, a sloppy text went to the PPer. Some PPers exhausted themselves reproofing the text to fix the mistakes that R2 had left. Others just processed the text and sent it off to PG, warts and all. One R2 proofer had proofed an astonishing number of pages ... but he did so by smoothreading them hurriedly, without checking against the image. He missed many errors. PPers complained. Readers of PG texts complained. The current workflow at DP is a *reaction* to the previous lack of quality control. That's why P3ers have to pass a test. That's why proofing and formatting were separated. OK, our quality control is strangling us. I don't think the answer is to go back to the good old days of two rounds and error-ridden texts. -- Karen Lofstrom From dakretz at gmail.com Mon Feb 15 13:15:26 2010 From: dakretz at gmail.com (don kretz) Date: Mon, 15 Feb 2010 13:15:26 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> Message-ID: <627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com> Nor is anyone suggesting going back. I was describing the progression and how it has affected the relationship between the users and the work. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Mon Feb 15 13:26:09 2010 From: dakretz at gmail.com (don kretz) Date: Mon, 15 Feb 2010 13:26:09 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com> Message-ID: <627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com> There have at each step been a number of alternatives for dealing with quality issues. We (or someone, it was hardly "we") made choices which had consequences. One of the consequences was improved quality. Another was a change in the user's work experience (always a greater constraint, notice, seldom if ever improved user tools.) We are where we are. We can I suppose say it was done the best way possible, and what we have is the inevitable cost of the improvements." I think that's a difficult position to defend. Which is exactly what roger is, intentionally or not, making quite clear. We can't recast the decisions made in the past, but we need to do a better job of learning from them and dong better. Sooner would be nicer than later. Hence rfrank's project. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Mon Feb 15 14:03:16 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Mon, 15 Feb 2010 14:03:16 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> Message-ID: <9F120957CF48439F9C63FD74DE1B25F7@alp2400> As a Whitewasher who's dealt with old DP productions as well as new ones, over the last couple of years, I second (and third and fourth) everything Karen says. Others may hold DP's current system to be inefficient/slow/etc,, but it does one thing that makes it all worth while--it can produce error-free texts. Example: I'm currently dealing with an errata report for an old DP production. I haven't looked into the problem in detail yet, but from what I've seen, at least several pages are missing, followed by a repeat of material that precedes the missing material. I'm going to have to go through the problem area of the posted text, compare it to a scanset, figure out which material is missing/redundant, OCR and proof whatever's missing, knit it into the text, then run Gutcheck/Jeebies/Gutspell on the repaired text, which will undoubtedly unearth a raft of other errors, all followed by a reformat and a repost. Also undoubtedly, many other errors will remain. Is it worth it? Personally speaking, no. It's going to take hours to fix this text, time that I'd far rather spend on my own productions, but there's currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to fix this kind of thing. (Probably simpler to just re-do this text from scratch, which is something *I'm* not about to do.) In short, DP's current processes produce error-free texts; its old processes, from what I've seen of the results, didn't. Al ----- Original Message ----- From: "Karen Lofstrom" To: "Project Gutenberg Volunteer Discussion" Sent: Monday, February 15, 2010 12:47 PM Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? > On Mon, Feb 15, 2010 at 7:43 AM, don kretz wrote: > > Re the two-round system: > >> Clean, simple, and most importantly it provides each person >> with the immediate and obvous positive gratification of >> seeing their work self-evidently closing the gap between >> the text and the picture. > > Yes, and it often produced godawful results. If the R2 proofrer was > sloppy, a sloppy text went to the PPer. Some PPers exhausted > themselves reproofing the text to fix the mistakes that R2 had left. > Others just processed the text and sent it off to PG, warts and all. > > One R2 proofer had proofed an astonishing number of pages ... but he > did so by smoothreading them hurriedly, without checking against the > image. He missed many errors. > > PPers complained. Readers of PG texts complained. The current workflow > at DP is a *reaction* to the previous lack of quality control. That's > why P3ers have to pass a test. That's why proofing and formatting were > separated. OK, our quality control is strangling us. I don't think the > answer is to go back to the good old days of two rounds and > error-ridden texts. > > -- > Karen Lofstrom > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From dakretz at gmail.com Mon Feb 15 14:12:20 2010 From: dakretz at gmail.com (don kretz) Date: Mon, 15 Feb 2010 14:12:20 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <9F120957CF48439F9C63FD74DE1B25F7@alp2400> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <9F120957CF48439F9C63FD74DE1B25F7@alp2400> Message-ID: <627d59b81002151412j40dba25bm9718d1c20670c9a@mail.gmail.com> I can't think of anyone I know who would argue otherwise. That's not an issue that's open for discussion, I don't think. On Mon, Feb 15, 2010 at 2:03 PM, Al Haines (shaw) wrote: > As a Whitewasher who's dealt with old DP productions as well as new ones, > over the last couple of years, I second (and third and fourth) everything > Karen says. > > Others may hold DP's current system to be inefficient/slow/etc,, but it > does one thing that makes it all worth while--it can produce error-free > texts. > > Example: I'm currently dealing with an errata report for an old DP > production. I haven't looked into the problem in detail yet, but from what > I've seen, at least several pages are missing, followed by a repeat of > material that precedes the missing material. I'm going to have to go > through the problem area of the posted text, compare it to a scanset, figure > out which material is missing/redundant, OCR and proof whatever's missing, > knit it into the text, then run Gutcheck/Jeebies/Gutspell on the repaired > text, which will undoubtedly unearth a raft of other errors, all followed by > a reformat and a repost. Also undoubtedly, many other errors will remain. > > Is it worth it? Personally speaking, no. It's going to take hours to fix > this text, time that I'd far rather spend on my own productions, but there's > currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to > fix this kind of thing. (Probably simpler to just re-do this text from > scratch, which is something *I'm* not about to do.) > > In short, DP's current processes produce error-free texts; its old > processes, from what I've seen of the results, didn't. > > Al > > > ----- Original Message ----- From: "Karen Lofstrom" > To: "Project Gutenberg Volunteer Discussion" > Sent: Monday, February 15, 2010 12:47 PM > Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? > > > > On Mon, Feb 15, 2010 at 7:43 AM, don kretz wrote: >> >> Re the two-round system: >> >> Clean, simple, and most importantly it provides each person >>> with the immediate and obvous positive gratification of >>> seeing their work self-evidently closing the gap between >>> the text and the picture. >>> >> >> Yes, and it often produced godawful results. If the R2 proofrer was >> sloppy, a sloppy text went to the PPer. Some PPers exhausted >> themselves reproofing the text to fix the mistakes that R2 had left. >> Others just processed the text and sent it off to PG, warts and all. >> >> One R2 proofer had proofed an astonishing number of pages ... but he >> did so by smoothreading them hurriedly, without checking against the >> image. He missed many errors. >> >> PPers complained. Readers of PG texts complained. The current workflow >> at DP is a *reaction* to the previous lack of quality control. That's >> why P3ers have to pass a test. That's why proofing and formatting were >> separated. OK, our quality control is strangling us. I don't think the >> answer is to go back to the good old days of two rounds and >> error-ridden texts. >> >> -- >> Karen Lofstrom >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d at lists.pglaf.org >> http://lists.pglaf.org/mailman/listinfo/gutvol-d >> >> > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Feb 15 15:51:53 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 15 Feb 2010 18:51:53 EST Subject: [gutvol-d] Re: When DP started, here was the basic process Message-ID: <17845.277aae89.38ab3819@aol.com> don said: > So it should now be obvious to you > why the thread was not hijacked um, no. gardner's thread was most certainly hijacked. he was looking for other solo producers, and someone who might be interested in that topic now has to plow through a bunch of posts that talk about something completely different... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Feb 15 16:01:30 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 15 Feb 2010 19:01:30 EST Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? Message-ID: <17cc8.47818a6d.38ab3a5a@aol.com> andrew said: > An easy text that has come through DP > can be prepared and submitted in one day; > a more difficult one can take a week or two; > a really hard one might take months > working on it on and off. > > Now take those same texts without the DP preparation, where > an individual starts working himself from the ocr output. > The easy text could take perhaps three to six weeks; > the more difficult one five to eight months or longer; > and the hardest texts that have been done through DP > could never have been attempted by an individual. you just made up all of those numbers. take an easy text, the kind that can be "prepared and submitted in one day" after having gone through d.p., but which -- according to you -- "could take perhaps three to six weeks" were it to be done by a solo person. your figures are just ridiculous... it takes an hour, perhaps two or three, to spellcheck a typical easy book and get it formatted into shape... for a more difficult book, the spellchecking time is dwarfed by the formatting task, which is not really significantly lessened by having gone through d.p. and no text is so difficult that it "could never have been attempted by an individual", so that's just balderdash... there might not be any individuals who _are_ motivated to take on big projects, but given the rate at which these big projects get finished at d.p., the gap isn't all that big. > One other very significant aspect is that DP > has been set up to encourage a sense of community. i'm not sure charlz "set up" d.p. for that specific purpose. it's true that a sense of community _has_ developed there. but that can happen just about anywhere. it's also the case that the d.p. community indulges itself often in groupthink, which is one down side of "community". i'm not arguing that the down side offsets the good, because i don't think it does, but if we are going to mention one side, let's mention both... > And you have ready access to people with > specialized knowledge about many languages, > musical notation, obscure unicode characters, > obselete typesetting conventions, etc. that's true. but it's also the case that that "ready access" _could_ have developed right here, on this listserve, and been available to everyone, including the "solo" producers. so having it exist only within the d.p. silo is a bit regrettable. > At no point in the history of DP was well, carlo has already pointed out that andrew's memory is a bit foggy on this particular point. and that happens with individuals as we grow older, so there's no shame in that... but there's a tendency among d.p. people to rewrite history, almost always in a way that's favorable to their interpretation, so it's always refreshing when that tendency gets a fact-check. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Feb 15 16:31:33 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 15 Feb 2010 19:31:33 EST Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? Message-ID: <18a2b.33166dc4.38ab4165@aol.com> al said: > In short, DP's current processes produce error-free texts; > its old processes, from what I've seen of the results, didn't. oh my, this is just too rich. a full-on admission that an "old" d.p. e-text was full of bugs. too rich. because d.p. cheerleaders -- like karen "zora" lofstrom -- have _always_ maintained that d.p. output was super-clean; it's the stuff from the _individual_ producers that is shoddy, _not_ the material from d.p. you can see this same attitude expressed _to_this_very_day_ over on the d.p. forum boards. of course, it was _easy_ to prove them wrong in the old days; all you had to do was make a laundry-list of errors in a text. (i provided such a list of errors to this very listserve for a book that was postprocessed by zora herself -- #13603 -- and the _hundreds_ of errors i located have _still_ not been repaired, even though the book was posted way back in october of 2004. so much for zora's stance of superiority. her work is flawed.) anyway, after enough laundry-lists of errors had been made, d.p. people finally had to admit their quality-control was faulty. sadly, they didn't know how to fix their system, so they just piled on more rounds, and built a flawed "certification system" to promote some proofers to "final-round" status, which only had the effect of stagnating their workflow with huge queues, as a boatload of books (thousands!) plugged up the system... and they've clung to this hierarchical model in the face of clear evidence (from their own experiments!) that _proved_ that the p3 proofers aren't any better than the p1 proofers... and even though it's perfectly clear that you can get good pages without subjecting every page in every book to 3 proofing rounds and 2 formatting rounds, following by postprocessing, and then maybe smoothreading, and then maybe postprocessing verification, nonetheless that's what their workflow system calls for them to do. so they experiment with ways to circumvent that workflow system, instead of just fixing it. it is a comedy of errors, in slow-motion... but hey, as long as you get "error-free texts", then who cares if you're wasting tons of time and energy donated by the volunteers? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From sly at victoria.tc.ca Mon Feb 15 22:11:28 2010 From: sly at victoria.tc.ca (Andrew Sly) Date: Mon, 15 Feb 2010 22:11:28 -0800 (PST) Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <20100215192729.AFB1DFFB5@cardano.dm.unipi.it> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <20100215192729.AFB1DFFB5@cardano.dm.unipi.it> Message-ID: On Mon, 15 Feb 2010, Carlo Traverso wrote: > >>>>> "Andrew" == Andrew Sly writes: > > Andrew> At no point in the history of DP was the output of the > Andrew> rounds "strung together and handed directly off to PG". I > Andrew> cannot recall if the name of "post-processor" has always > Andrew> been used--but there has always been someone in that role. > > When I started at DP, in 2002, the work needed to pass from the R2 > output to posting to PG was officially estimated in 30 minutes, > without any specialized tool. I think that "strung together and handed > directly off to PG" is a correct metaphor for 30 minutes of > work. Enough to remove the separators, reflow the line ends, and that > was all. No formatting (italics converted to uppercase for ship > names), accents removed, no spell-checking, no gutcheck. This was a > task of the project manager, and handing the task to somebody else was > exceptional. Thanks Carlo. Perhaps my memory has become hazy in the intervening years. :) But still I questions your list. Why accents removed? It was fairly routine to post latin-1 texts at that time. (I can find an "8-bit" text as #1595, with a release date if Jan, 1999.) The earliest reference to gutcheck that I can find in my old emails is on Tue, 23 Jul 2002, but I don't think it was in common use yet. It was actually something that Jim T. had written as an evaluation tool for submitted texts. > Of course, even then, it took to me much longer to complete a book, > since I used to re-read the book to catch a bunch of remaining errors. I did the same with the project I ran through DP at that time as well. Perhaps that's why I assumed it was the norm. --Andrew From traverso at posso.dm.unipi.it Tue Feb 16 00:44:00 2010 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Tue, 16 Feb 2010 09:44:00 +0100 (CET) Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: (message from Andrew Sly on Mon, 15 Feb 2010 22:11:28 -0800 (PST)) References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <20100215192729.AFB1DFFB5@cardano.dm.unipi.it> Message-ID: <20100216084400.933ACFFCA@cardano.dm.unipi.it> >>>>> "Andrew" == Andrew Sly writes: Andrew> On Mon, 15 Feb 2010, Carlo Traverso wrote: Andrew> Thanks Carlo. Perhaps my memory has become hazy in the Andrew> intervening years. :) Andrew> But still I questions your list. Why accents removed? It Andrew> was fairly routine to post latin-1 texts at that time. (I Andrew> can find an "8-bit" text as #1595, with a release date if Andrew> Jan, 1999.) These were the DP guidelines, (copied from PG official guidelines), I remember Ultima Thule, a book on iceland, with a discussion on what to do of the eths in names (that were eventually replaced with th) while the accents were routinely dropped. The book eventually was redone from scratch, it might have been the last one before DP changed officially to preserving accents. Carlo From walter.van.holst at xs4all.nl Tue Feb 16 01:12:42 2010 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Tue, 16 Feb 2010 10:12:42 +0100 Subject: [gutvol-d] Re: =?utf-8?q?Many_solo_projects_out_there_in_gutvol-d_land=3F?= In-Reply-To: <627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com> <627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com> Message-ID: On Mon, 15 Feb 2010 13:26:09 -0800, don kretz wrote: > or not, making quite clear. We can't recast the decisions made in the > past, but we need to do a better job of learning from them and dong > better. Sooner would be nicer than later. Hence rfrank's project. In that vein, how flexible is the DP software? I've been wondering to what extent parallel P1 rounds might be helpful. I find P2 proofing exceedingly boring because of the small number of errors that are left to be fixed in texts that are well-scanned and well-proofed in P1. I can't imagine how mind-numbing P3 will be if I ever become eligible for that 'status'. I can imagine that only having to look at the differences between redundant P1 proofed texts might be helpful since it would take two independent P1 proofers to overlook the same error to have it slip through. Another potential improvement might be to make texts available to the next round on a per page basis instead of having to wait for all pages to be finished in the previous round. Aforementioned suggestions may be silly, feel free to point out their silliness. Regards, Walter From traverso at posso.dm.unipi.it Tue Feb 16 02:30:30 2010 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Tue, 16 Feb 2010 11:30:30 +0100 (CET) Subject: [gutvol-d] Re: =?utf-8?q?Many_solo_projects_out_there_in_gutvol-d_land=3F?= In-Reply-To: (message from Walter van Holst on Tue, 16 Feb 2010 10:12:42 +0100) References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com> <627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com> Message-ID: <20100216103030.B767FFFCA@cardano.dm.unipi.it> >>>>> "Walter" == Walter van Holst writes: Walter> On Mon, 15 Feb 2010 13:26:09 -0800, don kretz Walter> wrote: >> or not, making quite clear. We can't recast the decisions made >> in the past, but we need to do a better job of learning from >> them and dong better. Sooner would be nicer than later. Hence >> rfrank's project. Walter> In that vein, how flexible is the DP software? I've been Walter> wondering to what extent parallel P1 rounds might be Walter> helpful. I find P2 proofing exceedingly boring because of Walter> the small number of errors that are left to be fixed in Walter> texts that are well-scanned and well-proofed in P1. I Walter> can't imagine how mind-numbing P3 will be if I ever become Walter> eligible for that 'status'. I can imagine that only having Walter> to look at the differences between redundant P1 proofed Walter> texts might be helpful since it would take two independent Walter> P1 proofers to overlook the same error to have it slip Walter> through. This would be simple enough, just allowing a PM to load a set of txt files and a dummy proofer name in one of the projects columns. The administrators (having DB access) do this if asked, I suppose with a script (I have one in the test site). Another improvement would be to allow a PM to skip a round; this too is reserved to the few, overloaded administrators, but it is just changing a flag at one point in the code. Walter> Another potential improvement might be to make texts Walter> available to the next round on a per page basis instead of Walter> having to wait for all pages to be finished in the Walter> previous round. This might be trickier, since the whole philosophy of DP code is based on rounds and per-round permissions. It would require at least to start a new test DP site in which new changes in the code are made and extensively experimented in a live environment. The current test site is used for testing features that are potentially disruptive, and is inadequate for live testing: it is for alpha testing, a beta testing site would be necessary, or probably more than one. rfrank's test site at fadepage has abandoned the round philosophy, but is not derived from DP code, it is reimplemented from scratch. Walter> Aforementioned suggestions may be silly, feel free to Walter> point out their silliness. Not silly at all; I believe that the main problem of DP is its rigidity, the "one size fits all" philosophy, that is partly in the code, but mostly in the procedures, and is necessary in a huge structure. Smaller DP sites like DP-EU and DP-CAN have shown a more flexible structure, so I believe that a confederation of different DP sites, sharing a common aim and a common codebase, but different local laws and software configurations, and a loose coordination, would be a better model. Carlo From dakretz at gmail.com Tue Feb 16 07:43:50 2010 From: dakretz at gmail.com (don kretz) Date: Tue, 16 Feb 2010 07:43:50 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <20100216103030.B767FFFCA@cardano.dm.unipi.it> References: <4B76F4C6.3030006@teksavvy.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com> <627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com> <20100216103030.B767FFFCA@cardano.dm.unipi.it> Message-ID: <627d59b81002160743x6eb4788dld0875f9995a936fc@mail.gmail.com> I'm biting my tongue, Carlo. The difficulties aren't primarily with the code, which can be (and on occasion has been) amended to overcome those types of problems. However, none of our volunteers has considered it appropriate or within the scope of their skills or interests to, for instance, document it; so it's pretty closely held within a small group. On Tue, Feb 16, 2010 at 2:30 AM, Carlo Traverso wrote: > >>>>> "Walter" == Walter van Holst writes: > > Walter> On Mon, 15 Feb 2010 13:26:09 -0800, don kretz > Walter> wrote: > > >> or not, making quite clear. We can't recast the decisions made > >> in the past, but we need to do a better job of learning from > >> them and dong better. Sooner would be nicer than later. Hence > >> rfrank's project. > > Walter> In that vein, how flexible is the DP software? I've been > Walter> wondering to what extent parallel P1 rounds might be > Walter> helpful. I find P2 proofing exceedingly boring because of > Walter> the small number of errors that are left to be fixed in > Walter> texts that are well-scanned and well-proofed in P1. I > Walter> can't imagine how mind-numbing P3 will be if I ever become > Walter> eligible for that 'status'. I can imagine that only having > Walter> to look at the differences between redundant P1 proofed > Walter> texts might be helpful since it would take two independent > Walter> P1 proofers to overlook the same error to have it slip > Walter> through. > > This would be simple enough, just allowing a PM to load a set of txt > files and a dummy proofer name in one of the projects columns. The > administrators (having DB access) do this if asked, I suppose with a > script (I have one in the test site). Another improvement would be to > allow a PM to skip a round; this too is reserved to the few, overloaded > administrators, but it is just changing a flag at one point in the code. > > > Walter> Another potential improvement might be to make texts > Walter> available to the next round on a per page basis instead of > Walter> having to wait for all pages to be finished in the > Walter> previous round. > > This might be trickier, since the whole philosophy of DP code is based > on rounds and per-round permissions. It would require at least to start > a new test DP site in which new changes in the code are made and > extensively experimented in a live environment. The current test site > is used for testing features that are potentially disruptive, and is > inadequate for live testing: it is for alpha testing, a beta testing > site would be necessary, or probably more than one. > > rfrank's test site at fadepage has abandoned the round philosophy, but > is not derived from DP code, it is reimplemented from scratch. > > Walter> Aforementioned suggestions may be silly, feel free to > Walter> point out their silliness. > > Not silly at all; I believe that the main problem of DP is its > rigidity, the "one size fits all" philosophy, that is partly in the > code, but mostly in the procedures, and is necessary in a huge > structure. Smaller DP sites like DP-EU and DP-CAN have shown a more > flexible structure, so I believe that a confederation of different DP > sites, sharing a common aim and a common codebase, but different local > laws and software configurations, and a loose coordination, would be a > better model. > > Carlo > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Feb 16 13:47:33 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 16 Feb 2010 16:47:33 EST Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? Message-ID: <15691.224faab8.38ac6c75@aol.com> walter said: > In that vein, how flexible is the DP software? it depends. in general it's not all that flexible... but sometimes a little creativity goes a long way. even more inflexible than the software, however, is the willingness of its coders to change anything. and even when there is someone like dakretz who is willing to roll up his sleeves and do some work, the administrators won't let him. so there you go. > I've been wondering to what extent > parallel P1 rounds might be helpful. ... > Aforementioned suggestions may be silly, > feel free to point out their silliness. it's _good_ to "wonder" about things, walter. it means your mind is working on a solution. so that part isn't silly at all. and the part about parallel p1 rounds is not silly either. to the contrary, it's a good idea; might not _work_, but it's still a good _idea_. so that's not silly at all. what _is_ silly, however, is that -- in spite of the fact that people have had this good idea for a very long time now -- d.p. has _never_ actually _tested_ it directly to see if it works. oh, they've run some research, and tried out some things, but they've never actually done a full-on _experiment_ to test the hypothesis. so, for years and years a parallel-proof idea has been around, but we're still "wondering" whether it might work or not. _that_ is silly... for the record, once again, i've reassembled some data from various d.p. "experiments" (i'm using the term extremely loosely here) and i've even written the software that helps you reconcile two iterations of parallel proofs, so i can give you some conclusions on all that, namely that it doesn't give you better accuracy, and thus it certainly doesn't outweigh the cost of doing the reconciliation (which is rather high, even given a good tool), so i don't recommend it. however, a focused experiment on this matter would be good, so as to validate my findings... having said all that, though, there's a "variant" on parallel proofing that you might find interesting... taking o.c.r. results from 2 different sources and comparing them to find their differences and then resolving those differences and calling it "finished" _does_ happen to be an extremely effective strategy, since it avoids all the word-by-word proofing rounds. i documented all this on a thread on the d.p. forums, entitled "a revolutionary methodology for proofing", or something to that effect... you could look it up... > I find P2 proofing exceedingly boring > because of the small number of errors > that are left to be fixed in texts that are > well-scanned and well-proofed in P1. well, there's a lot that could be said about this, walter. perhaps first and foremost is that proofing _is_ boring. especially a word-by-word proofing on an accurate text. > I can't imagine how mind-numbing P3 will be > if I ever become eligible for that 'status'. since most of the o.c.r. errors are gone by the time of p3, most p3 proofers have resorted to trying to find errors in the book itself, errors that the publisher/typesetter made. this lets them leave a comment, so they can do something. for instance, in the book that i'm now examining which rfrank used in his "roundless" experiment, there were 50 comments left in a 240-page book, or 20% of the pages. of course, addressing all these comments is a task that is done by the postprocessor, which is one of many reasons why that job has become more taxing in the current era... > I can imagine that only having to look at the differences > between redundant P1 proofed texts might be helpful > since it would take two independent P1 proofers > to overlook the same error to have it slip through. well, yes, and that's the main argument for parallel proofing. but it ends up that yes, indeed, "two independent p1 proofers" often _do_ "overlook the same error" and it then slips through. and in the same manner, sometimes an independent p1 and p2 proofer "overlook the same error" and it then slips through to p3. now it would be great if we had some solid _data_ on the numbers, so we could decide how much energy we want to spend on catching these errors that slip through. we've found that _some_ errors can go as many as 7 or 8 or 9 rounds without being caught, but no one is suggesting we spend that many proofing rounds on every page... so we have to decide how many rounds we will expend our energy, in order to catch what percentage of errors. it's really that simple. and to make that decision, it would be great if we had some data. and it's silly -- ridiculous! -- that we have not collected that data. > Another potential improvement might be to make > texts available to the next round on a per page basis > instead of having to wait for all pages to be finished > in the previous round. well, now you're suggesting a "roundless" system, walter. which is also not a silly suggestion. unfortunately, it's not a _new_ suggestion either, so you're not advancing the art. what you _are_ doing is showing we have no data on _this_ particular wrinkle either, even though it's a very old idea... and again, this failure to collect data and test hypotheses is extremely silly, especially since we debate matters endlessly. like clara peller bellowing "where's the beef?", we should now make it a community slogan to demand "where's your data?" meanwhile, i keep myself busy by collecting what data i can, and writing the software tools that we need to do these jobs. and i talk and talk, but most people here are too busy being silly to listen to me. which i find to be endlessly amusing. :+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Feb 16 15:34:03 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 16 Feb 2010 18:34:03 EST Subject: [gutvol-d] roundlessness -- 004 Message-ID: <18e16.1867d154.38ac856b@aol.com> we're looking at rfrank's "roundless" experiment at fadedpage.com... *** i'm going over the data for one of the books rfrank used in his test, and once again the results i observe are striking and unequivocal... the proofers made hundreds of changes in this 240-page book, but most of 'em could've been detected and fixed during preprocessing, which would've made the workflow both more smooth and efficient. sure, there is the occasional stealth scanno -- "array" for "army", and "riot" for "not" -- which (one could argue) would seem to require the word-by-word proofing that is expected at distributed proofreaders. but they are few and far between, and in almost all cases innocuous. and certainly one round of such close proofing will be all that would be needed if the obvious-and-easy-to-automatically-detect errors were found and fixed in preprocessing. once these obvious glitches have been fixed, the proofer is essentially doing _smooth-reading_... this is of the utmost importance if you really want (as rfrank claims) to have each page be "one and out", (i.e., be finished by one proofer). otherwise once is simply not enough, not for a good many pages... it's also the case that -- with the right tool -- doing preprocessing is fun and exhilarating. it's really a kick in the pants to be able to improve a book so quickly and efficiently, and move it to "the finish". compared to the boring nature of proofing, there is no comparison... i have demonstrated this same finding on book after book after book, with no exceptions, so i am quite confident that it is extremely robust. all you have to do is look for it, and i assure you that you will find it... i wonder why so many of you are so resistant to learning the truth... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Feb 17 10:43:58 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Feb 2010 13:43:58 EST Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? Message-ID: yeah, i know, a call for actual _data_. what a bummer, man. ruins _all_ your fun, and brings the dialog to an abrupt stop. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Wed Feb 17 11:20:48 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Wed, 17 Feb 2010 11:20:48 -0800 Subject: [gutvol-d] "The Inheritance" by Susan Edmondstone Ferrier Message-ID: If anyone's looking for a project, look no further than the above. Internet Archive has assorted editions, none of which are projects in DP. All editions appear to be clearable under Rule1 (pre-1923). Hmmm... maybe bowerbird is up for them? Al From Bowerbird at aol.com Wed Feb 17 12:41:20 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Feb 2010 15:41:20 EST Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier Message-ID: <12985.490903c3.38adae70@aol.com> al said: > Hmmm...? maybe bowerbird is up for them? what's my motivation, al? the one edition i looked at -- in english, i speak no german -- is a simple book, quite straightforward, so wouldn't prove anything. i mean, i'm totally willing to run through the exercise-wheel, but what do i get when i come out the other side? how about this, a win-win-win-win situation for everyone... for a while, the p.g. website has been directing newbies over to distributed proofreaders if they want to help out with the cause. and sure enough, d.p. gets a ton of volunteers as a result of that. unfortunately, d.p. doesn't appreciate the newbies, because they just contribute more stuff to the plethora of p1-proofed backlog. so how about, if i were to do this book for you, p.g. would start sending all the new volunteers to rfrank's roundless site instead? d.p. happy, rfrank happy, the bird happy, al happy, and p.g. happy. so, do we have a deal? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Wed Feb 17 14:50:07 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Wed, 17 Feb 2010 14:50:07 -0800 Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier References: <12985.490903c3.38adae70@aol.com> Message-ID: <97FD5D5CD0E846AD94214B14886737BA@alp2400> I'm not aware of any general PG practice to send newcomers to DP. Your motivation? What do I care? Be altruistic, and do a book. If you want something challenging, PG's Preprints page (http://preprints.readingroo.ms/) has lots of candidates. ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; bowerbird at aol.com Sent: Wednesday, February 17, 2010 12:41 PM Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier al said: > Hmmm... maybe bowerbird is up for them? what's my motivation, al? the one edition i looked at -- in english, i speak no german -- is a simple book, quite straightforward, so wouldn't prove anything. i mean, i'm totally willing to run through the exercise-wheel, but what do i get when i come out the other side? how about this, a win-win-win-win situation for everyone... for a while, the p.g. website has been directing newbies over to distributed proofreaders if they want to help out with the cause. and sure enough, d.p. gets a ton of volunteers as a result of that. unfortunately, d.p. doesn't appreciate the newbies, because they just contribute more stuff to the plethora of p1-proofed backlog. so how about, if i were to do this book for you, p.g. would start sending all the new volunteers to rfrank's roundless site instead? d.p. happy, rfrank happy, the bird happy, al happy, and p.g. happy. so, do we have a deal? -bowerbird ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Feb 17 15:55:59 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Feb 2010 18:55:59 EST Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier Message-ID: <1929b.3e85d22f.38addc0f@aol.com> al said: > I'm not aware of any general PG?practice to send newcomers to DP. i'm not surprised to learn that you're not paying any attention, al. > Your motivation? yeah, you know, the reason why i would spend time and energy on this. > What do I care? why did you suggest i do this book? > Be altruistic, and do a book. i'm altruistic just by being here, sharing my analyses with people, al. i'm altruistic when i _do_ the research that _leads_ to those analyses. i'm altruistic when i design and program tools that do what's required, because that proves that such tools can indeed be designed and coded, and it lets me be precise and experienced when i assess their value... i'm altruistic when i work up my suggestions for an improved workflow. i'm altruistic in a number of ways, al. to "do a book" seems unnecessary, unless you're intentionally suggesting that i should lower my sights a lot. i clean text for the fun of it, al, just like my girlfriend does her sudoku. so i don't need a _lot_ of motivation... but i certainly do need _some_... you've given me no good reason to work on the book you've suggested, al, so i'm not sure why you even bothered to mention my name at all... > If you want something challenging in the exact same way that _i_ will decide how i will be altruistic (or even _if_ i will be altruistic), i will also be the one who decides what is "challenging" to me. but, like, thanks for your suggestion, and have a nice day, ok? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Wed Feb 17 20:21:57 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Wed, 17 Feb 2010 23:21:57 -0500 Subject: [gutvol-d] Re: Bowerbird's software projects In-Reply-To: <1929b.3e85d22f.38addc0f@aol.com> References: <1929b.3e85d22f.38addc0f@aol.com> Message-ID: <4B7CC065.6030004@teksavvy.com> On 17-Feb-2010 18:55, Bowerbird at aol.com wrote: > > i'm altruistic when i design and program tools that do what's required, > because that proves that such tools can indeed be designed and coded, > Where was that Sourceforge project again? I know you've talked about tools that do more/better checking than Gutcheck and have automated fixing and such. I would like to try them out. Where can get get my hands on this stuff? ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From Bowerbird at aol.com Thu Feb 18 10:47:26 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Feb 2010 13:47:26 EST Subject: [gutvol-d] Re: Bowerbird's software projects Message-ID: gardner said: > Where was that Sourceforge project again?? there is no sourceforge project. my source code has never been open-source. my enemies will need to write their own code. > I know you've talked about tools that > do more/better checking than Gutcheck actually, i don't think i've ever compared my stuff with any other software directly, because any tool is better than no tool. gutcheck has some charms. my tools do different things, and do things differently, but whether they do "more" or "better" is not an issue. > and have automated fixing and such. some have, yes. it's also important to remember that i do lots of experimentation, with quick-and-dirty code that serves to test the usefulness of a particular feature, but which might never be implemented further, perhaps because it doesn't prove to be worthy, or because the generalized code would take more time than i can give, or simply because that task just hasn't been done yet... > I would like to try them out.? > Where can get get my hands on this stuff? i'll be happy to send you a copy, gardner, since you are an independent producer -- that's my target sweetspot. until i release the program generally, which might be very soon but also might not, you'll have to agree not to distribute the app any further, since i want to know who has it so that i can engage them in dialog about it, but that's the only restriction at this point. you'll also need to tell me what version you want -- mac or p.c. or linux. your signature-block screams out linux, which is fine, but you'd be one of my first linux users, so if you want the more-well-tested windows version, say so. finally, please give a short description -- frontchannel -- of your _current_ workflow. how do you do your books? do you use an editor, or some other tool? use gutcheck? what kind of preprocessing do you do on the raw o.c.r.? if you need to view a scan to check the text on some page, how do you do that? how do you find errors, with reg-ex?, or via a word-by-word proof of every page? anything else? if anyone else wants to get a copy of my program, say so, either frontchannel or back. the same conditions apply... also, if you're interested, you should check out don's app: > http://code.google.com/p/dp50/downloads/list his tool is similar in many ways, and you might like it too. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From joey at joeysmith.com Thu Feb 18 14:56:25 2010 From: joey at joeysmith.com (Joey Smith) Date: Thu, 18 Feb 2010 15:56:25 -0700 Subject: [gutvol-d] Re: Bowerbird's software projects In-Reply-To: References: Message-ID: <20100218225625.GA29062@joeysmith.com> On Thu, Feb 18, 2010 at 01:47:26PM -0500, Bowerbird at aol.com wrote: > if anyone else wants to get a copy of my program, say so, > either frontchannel or back. the same conditions apply... I believe I've already said so, for Linux, at least twice. Each time I was told I'd have to join some "Yahoo!" listserv, which is too far to go for a an unproven piece of software...I generally try pretty hard to keep my information out of the clutches of Yahoo! From Bowerbird at aol.com Thu Feb 18 16:20:03 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Feb 2010 19:20:03 EST Subject: [gutvol-d] roundlessness -- 005 Message-ID: <18218.6bab9c25.38af3333@aol.com> we're looking at rfrank's "roundless" experiment at fadedpage.com... *** i've been looking at the data for one of the books rfrank used... the book is titled "eagles of the sky", and it runs about 240 pages. in total, there were about 500 lines that were changed by proofers. here's a list of roughly 300 of those lines: > http://z-m-l.com/go/frabf/frabf300diffs.html the list shows the original line, from o.c.r., and the edited line. there's also a link to each page-scan, if you want to view that... i could've included all 500 changed lines in this list, except that the 200 lines i've excluded contained errors that _should_have_ (most definitely) been fixed in preprocessing, and it burns me up to be a witness to such a tremendous waste of proofer resources. it's just a crime against the generous contribution of the proofers to put such shoddy text in front of them and expect them to fix it. roger frank may think i'm talking shit about him again, but i tell you, i'll talk shit about _any_ producer who gives shoddy text to proofers, and, on top of that, i will feel _justified_ and _moral_ about doing it. i mean, it's not exactly a _mystery_ how to do good preprocessing. i have spent a lot of time writing posts here documenting _exactly_ how to do good preprocessing, and i did a heckuva lot of research to learn and test those preprocessing methods to prove their utility. so when someone just _ignores_ what i've done, and continues to burn out proofers by wasting their time and energy with material that should've had the obvious-and-easy-to-detect-automatically errors fixed before it went in front of 'em, i have a right to be mad. someone needs to talk some sense into roger's head, and do it fast. not to say that roger is the only one, or the worst one. not by far. i don't even bother looking at what the p.g. content producers are giving to their proofers these days, to protect my blood pressure. but since the odds that anything has changed over there are none and none, someone should talk some sense into _their_ heads too. it's a _shame_ to be wasting proofer time, and you producers who fail to do good preprocessing should be _ashamed_ of yourselves. *** at any rate, if that list of 300 changes is too dense for your brain, you can also thumb through the pages and see the changes made: > http://z-m-l.com/go/frabf/frabfp123.html all changed lines are marked in red (o.c.r.) and blue (edits), so if any particular page (like page 123, linked to above) is all-in-black, it means that none of the lines on that page had any changes made. you will find this "stepping through the edits" to be very friendly... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Thu Feb 18 16:37:14 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Thu, 18 Feb 2010 19:37:14 -0500 Subject: [gutvol-d] Re: Bowerbird's software projects In-Reply-To: References: Message-ID: <4B7DDD3A.6070801@teksavvy.com> On 18-Feb-2010 13:47, Bowerbird at aol.com wrote: > > actually, i don't think i've ever compared my stuff > with any other software directly, because any tool Perhaps not, but over time you have described checks that your tools can do and fixes that you can automatically make that sound a little to me like a super-duper gutcheck. Also the workflow I picture is a little like gutcheck -- I am thinking of text-in text-out command line tools, not something that needs to look at image scans or makes me talk to it in a fancy U/I. This is perhaps an inaccurate impression I have. If the comparison is totally inappropriate, I'm sorry. > > you'll also need to tell me what version you want -- mac > or p.c. or linux. your signature-block screams out linux, Probably something that would run in FreeBSD would be most useful -- a Linux build would, I think. Windows would be fine too. > > finally, please give a short description -- frontchannel -- > of your _current_ workflow. how do you do your books? This is still fairly accurate: http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_Voices#Gardner_Buchanan ...although I have a nicer flatbed scanner now. I used to always page-by-page scan, OCR and first-proof books that I was doing from physical copies. The last couple of books I've done instead by scanning, bulk OCR and then proof from the scans and raw OCR text, which I can do on the road with my laptop or anywhere I can mount a USB key for a couple of hours. After OCR I have a few basic things that I do via regular expressions in vi: I find and fix spaced punctuation, find and fix M-dashes. If there's any obvious consistent scannos -- the Heavysege item I just finished had Ys that looked to Finereader more like Vs, for example -- I will have a crack at finding those. I have been known to write a one-off perl script to get at something that bugs me enough. The thing is that I do not have a specific set of checks and fixes that I consistently do. I rely a lot on jeebies and gutcheck. I would like something perhaps with a wider range of things that it can find so I don't have to know all the things to look for. Over the years you have mentioned several automated checks and fixes that sounded sensible enough to me. I'm not keen enough to go back through the archives, find them and implement them -- but I am nevertheless interested in trying a tool like this out on a project to see if it adds value for what I do. Heck, you can grab http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt and just tell me what you find. I have no doubt there is lots to find. ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From Bowerbird at aol.com Thu Feb 18 18:21:13 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Feb 2010 21:21:13 EST Subject: [gutvol-d] [SPAM] re: Re: Bowerbird's software projects Message-ID: <1b289.d2043e8.38af4f99@aol.com> gardner said: > Perhaps not, but over time you have described > checks that your tools can do and > fixes that you can automatically make > that sound a little to me like a super-duper gutcheck. yes, except that those checks and fixes are most often programmed only into one-off versions of programs... it's usually the case that making those checks and fixes useful in the general case, against any random book, is a more difficult matter. this isn't an apology of any sort; it's just that my intentions (for the most part) are to show that a particular check can be accomplished, and is useful. so far i haven't concentrated on building them into my app, because nobody's really expressed much interest in the app. the app has a general spellcheck ability, and that captures a very high percentage of the errors that occur within a text. > Also the workflow I picture is a little like gutcheck -- > I am thinking of text-in text-out command line tools, > not something that needs to look at image scans or > makes me talk to it in a fancy U/I. i'm not sure what you mean by the workflow you "picture". i was asking about your _actual_ workflow, the one whereby you currently digitize books. are you saying that you now do your digitizations without ever looking at image scans? because i have a hard time imagining how you can do that. you should also know i am a mac person, for good reason. for me, the interface is prime. if you're looking for tools that work on a command-line, in a text-in-text-out way, i'm the wrong tree for you to be barking up, that's for sure. i certainly wouldn't call my interface "fancy". to the contrary, it's extremely utilitarian, and not very pretty, not pretty at all. but it _is_ an interface, with buttons and menus and all that nice stuff that makes the program a lot easier to work with... > a Linux build would, I think.? Windows would be fine too. i'll send you both. > This is still fairly accurate: ok, that was very useful... my tool assumes that the page-scans are in the same folder as the app, which is easy enough to satisfy. the tool also assumes that your text is all in one file, and that the page-boundary is of a certain type. i'd assume that your vi skills will enable you to satisfy this assumption in a fairly simple manner. other than that, i'd say you'll be good to go. > The last couple of books I've done instead by scanning, > bulk OCR and then proof from the scans and raw OCR text > which I can do on the road with my laptop or > anywhere I can mount a USB key for a couple of hours. that's how you'll want to operate with my software, yes. > After OCR I have a few basic things that I do > via regular expressions in vi: you can continue to do those things in vi if you like. global changes in vi are much quicker than going through one-by-one changes in the interface. > The thing is that I do not have > a specific set of checks and fixes that I consistently do. that's something you'll want to remedy. i did a series here a couple years back where i collected a list of checks that was necessary for the book i tested, and somebody turned that list into a set of reg-ex tests. you can find that set on the download page for don's app: >?? http://code.google.com/p/dp50/downloads/list indeed, since you are already using reg-ex, you'll probably find that you prefer don's tool over mine, since his program lets you actually _build_in_ your own list of reg-ex checks... > I rely a lot on jeebies and gutcheck.? so, when you get a report from them on the possible errors, you enter vi and use search to locate each one of the errors? > I would like something perhaps with > a wider range of things that it can find > so I don't have to know all the things to look for. well, yes. and you can find some very extensive lists of reg-ex checks, right on d.p. the problem is that many of those checks have a low signal-to-noise ratio, in that they create far too many false-alarms. this is a problem even with gutcheck and heebe-jeebe, if i'm not mistaken. so you really have to fine-tune your list of checks to the particular corpus on which you are working, to be useful. this is why don's app is so useful, because you can build in the list of checks you want to do, and modify it at will, and even enter in a specific reg-ex to see if it returns any hits. > Over the years you have mentioned > several automated checks and fixes > that sounded sensible enough to me. sounds like you really want to use that reg-ex list that was based on the month-long series that i did. > I'm not keen enough to go back through the archives, > find them and implement them -- but I am > nevertheless interested in trying a tool like this out > on a project to see if it adds value for what I do. having heard all this, i'd guess don's app is the one for you. >?? http://code.google.com/p/dp50/downloads/list i'll send mine to you too, but his is based on reg-ex checks... > http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt > and just tell me what you find.? I have no doubt there is lots if the scans are online too, or can be, i'll certainly take a look at it... without looking at them, i can't know if something is an error or not. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Thu Feb 18 19:39:01 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Thu, 18 Feb 2010 22:39:01 -0500 Subject: [gutvol-d] Re: Bowerbird's software projects In-Reply-To: <1b289.d2043e8.38af4f99@aol.com> References: <1b289.d2043e8.38af4f99@aol.com> Message-ID: <4B7E07D5.1070500@teksavvy.com> On 18-Feb-2010 21:21, Bowerbird at aol.com wrote: > > it's usually the case that making those checks and fixes > useful in the general case, against any random book, is > a more difficult matter. That's kind of my experience, I guess. Several fixes will suggest themselves, in the context of a given specific text. The next one might need different fixes. But that doesn't mean a long list of fixups might be tried when there's no cost to just adding tests/fixes to the list. > > for me, the interface is prime. if you're looking for tools > that work on a command-line, in a text-in-text-out way, > i'm the wrong tree for you to be barking up, that's for sure. > I see. > > i'll send you both. > Better stick to Windoze, if it's a GUI. > ok, that was very useful... my tool assumes that the page-scans > are in the same folder as the app, which is easy enough to satisfy. > > the tool also assumes that your text is all in one file, and that the > page-boundary is of a certain type. i'd assume that your vi skills > will enable you to satisfy this assumption in a fairly simple manner. > > other than that, i'd say you'll be good to go. > Text in one file -- check. I favour marking page boundaries with "===00123" these days, but a global search/replace can fix that. > > i did a series here a couple years back where i collected > a list of checks that was necessary for the book i tested, > and somebody turned that list into a set of reg-ex tests. > > you can find that set on the download page for don's app: > > http://code.google.com/p/dp50/downloads/list > Yes. Looking at that. I am not 100% sure I want to mess with Twister exactly, but the list of regular expressions looks interesting. I'm picturing building a perl script that applies all of these fixes, then creates a patch set based on the the differences it has introduced. I could then edit the patch set as a file, nuking changes that are wrong, and finally apply the patches for the changes I like. > > > I rely a lot on jeebies and gutcheck. > > so, when you get a report from them on the possible errors, > you enter vi and use search to locate each one of the errors? > Kind of. Jeebies and gutcheck reference specific line numbers. So I go through the output of these bottom up. For each hit I go to the specified line number and see what's up, fix if needed and then move to the previous hit. I work bottom to top so that changes I make don't invalidate the line numbers in the gutcheck output as I go. I find it takes a good couple of passes before I am satisfied I have all the genuine hits covered. Invariably the WW finds things I've missed anyhow. > > sounds like you really want to use that reg-ex list > that was based on the month-long series that i did. > Yeah. Got those. Like I say -- I will turn it into a perl script and see where that takes me. > > i'll send mine to you too, but his is based on reg-ex checks... > Would be great. Thanks. > > > http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt > > and just tell me what you find. I have no doubt there is lots > > if the scans are online too, or can be, i'll certainly take a look at it... Lots of choices there. http://www.canadiana.org/ECO/ItemRecord/48293?id=16c79d4f15394e51 http://www.archive.org/details/advocateanovel00heavgoog http://books.google.com/books?id=ot4OAAAAYAAJ&oe=UTF-8 There are no page numbers in the Gutenberg text though. See you, ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From dakretz at gmail.com Thu Feb 18 21:18:30 2010 From: dakretz at gmail.com (don kretz) Date: Thu, 18 Feb 2010 21:18:30 -0800 Subject: [gutvol-d] Re: Bowerbird's software projects In-Reply-To: <4B7E07D5.1070500@teksavvy.com> References: <1b289.d2043e8.38af4f99@aol.com> <4B7E07D5.1070500@teksavvy.com> Message-ID: <627d59b81002182118s4fd527a6q4e4bdb47d649dc4f@mail.gmail.com> For what it's worth, Twister comes out of pretty much your approach, Gardner. I worked for a long time from regexes in vi and am writing Twister to make as much of it as "batchy" as I can. For instance, when you load a regex file, you can click a button to get a count of each of the regexes in your list. I'm currently adding the ability to choose a regex and get a list of occurrences with 3 lines of context. The goal is to make it transparent how it works, and let you adjust it to make it work the way you do. But no guarantees - it's still buggy. :( You might want to wait two or three days for a newer version.) (I sure don't miss the requirement in vi to add all those backslashes you don't need in any other regex context I know of...) On Thu, Feb 18, 2010 at 7:39 PM, Gardner Buchanan wrote: > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Thu Feb 18 23:18:31 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri, 19 Feb 2010 08:18:31 +0100 Subject: [gutvol-d] Re: Bowerbird's software projects In-Reply-To: <4B7E07D5.1070500@teksavvy.com> References: <1b289.d2043e8.38af4f99@aol.com> <4B7E07D5.1070500@teksavvy.com> Message-ID: <4B7E3B47.10408@perathoner.de> Gardner Buchanan wrote: > On 18-Feb-2010 21:21, Bowerbird at aol.com wrote: > >> it's usually the case that making those checks and fixes >> useful in the general case, against any random book, is >> a more difficult matter. The tragic bb in a nutshell. He gets one easy text, then builds a `program? that finds the bugs in that one easy text and proclaims it the ultimate fixing tools. ... Everybody laughs. ... BB waits one year. ... Repetitur. To build a useful tool you have to: 1. get two random samples of scans, say two sets of 100 complete book scans, using different scan techniques and different OCR on books of different ages and provenience. You could get those out of google or IA. 2. build a bug list of those OCRed texts against proven good copies. 3. build a program using the texts and error lists of the first group. You are not allowed to look at the second group texts. 4. run the program against your blind group and record the percentage of positives and negatives it finds. 5. run any known tools against the blind group and see if yours performs significantly better. 6. If better then brag else shut up. > That's kind of my experience, I guess. Several fixes will > suggest themselves, in the context of a given specific text. > The next one might need different fixes. But that doesn't mean > a long list of fixups might be tried when there's no cost > to just adding tests/fixes to the list. If you have to enter the regexes manually you should use any editor that supports them. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Fri Feb 19 01:35:34 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Feb 2010 04:35:34 EST Subject: [gutvol-d] Re: Bowerbird's software projects Message-ID: <4f.5dec8603.38afb566@aol.com> gardner said: > That's kind of my experience, I guess.? > Several fixes will suggest themselves, > in the context of a given specific text. > The next one might need different fixes. right. a rule i always follow is that when i find an error, i always search the rest of the text for other occurrences. > But that doesn't mean a long list of fixups > might be tried when there's no cost > to just adding tests/fixes to the list. well, except for what i mentioned about false alarms. it's a rare test that doesn't turn up any false alarms, but when one turns up too many of them, it becomes a liability instead of an asset. the question is always, "how many false alarms is too many?", and the next question is always, "how can i weed out false alarms?" > Better stick to Windoze, if it's a GUI. actually, they are both generated from the same code, so they should act identically. whether they really do... > Text in one file -- check.? > I favour marking page boundaries with "===00123" > these days, but a global search/replace can fix that. my app is looking for separator lines that look like this: {{myantp123.png}} || the_runhead || (note that there is a space in the first column.) that .png filename there is the name of the page-scan, and the program assumes you name your files wisely... so, for instance, if you want to jump to a certain page, you simply type the page number and press enter, and the program automatically jumps to that page. nifty... > Yes.? Looking at that.? I am not 100% sure > I want to mess with Twister exactly, but > the list of regular expressions looks interesting. > I'm picturing building a perl script that > applies all of these fixes, then creates a patch set > based on the the differences it has introduced.? > I could then edit the patch set as a file, > nuking changes that are wrong, and > finally apply the patches for the changes I like. i would be very surprised if you can make that workflow more efficient than simply editing text in the interface... the beauty of my app, and twister too, is that you can view the page-scan to help you make the edit decision. i'm well aware that you don't _need_ to view the scan in order to resolve the vast majority of questions, but the inefficiency in handling that thin minority is huge if the bureaucracy of viewing the scan is too convoluted. > Jeebies and gutcheck reference specific line numbers. try twister. seriously. the ability to jump right to the page where the question occurs, and view the scan in context, is a major boost to efficiency. i bet you will be surprised... > I find it takes a good couple of passes before > I am satisfied I have all the genuine hits covered. > Invariably the WW finds things I've missed anyhow. that's a sign of an inefficient workflow. you want to accomplish things in one pass, and you want to make sure you got all of it. > Got those.? Like I say -- I will turn it into > a perl script and see where that takes me. a perl script is operating blind. get a seeing-eye dog. > Lots of choices there. > http://www.canadiana.org/ECO/ItemRecord/48293?id=16c79d4f15394e51 > http://www.archive.org/details/advocateanovel00heavgoog > http://books.google.com/books?id=ot4OAAAAYAAJ&oe=UTF-8 none of those options are all that useful to me, however... what i need is to have the individual scans available online, each of them individually addressed with their own address. for instance, that "myantp123.png" file i referenced above, the one that reflects the scan of page 123 in "my antonia"? you can find that right here, in sequence with all the rest: > http://z-m-l.com/go/myant/myantp123.png this is the way the library of the future will be organized... if you want your work in it, mount your files appropriately. yes, i can download the .zip file of the scans from archive.org, or pull 'em from the google .pdf, and then mount them myself, but that's too much work for me to do, when you could have mounted them correctly in the first place. > There are no page numbers in the Gutenberg text though. then you threw away some very crucial information, didn't you? probably rewrapped the text too, am i right? and dehyphenated? all these actions make any kind of reproofing an impossible task. which is not to say that your proofing work was a waste of time. no, in such situations, i'll download the o.c.r. from archive.org, which _does_ still contain pagebreak info, and unwrapped text, and end-line hyphenates. and then i will use your proofed text to make the corrections to the archive.org o.c.r. and then i will throw your text away, and keep the corrected, unwrapped and page-marked text with the original end-line hyphenates in it... and when i throw away your text, i throw away your credit-line. had you kept all that valuable information which i need to have, instead of tossing it out, i probably would keep your credit-line. you know, just so you know... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Feb 19 12:47:09 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Feb 2010 15:47:09 EST Subject: [gutvol-d] banana cream for mac and windows is available Message-ID: <10a64.5e5dc575.38b052cd@aol.com> banana cream for mac and windows is available. the linux version seems to have a plug-in conflict, which i do not feel like debugging right now, so linux people can use the windows version, or wait. (note: i also have a version for the classic mac o.s.) if you want to be a tester for this nice little app, let me know and i'll tell you how you can get it... this is the version from the fall of 2008, which is the last time i worked on it in earnest, and i can't remember what kind of state i left it in, so there might well be some rough unfinished edges in it; but for the most part, it should be pretty smooth. like i said, though, it ain't pretty. ain't pretty at all. ;+) -bowerbird p.s. if people want to use it, i'd be happy to work on it again, to include incorporating all of those reg-ex tests that gardner was requesting. but... as only gardner and joey (yes joey, i did hear you) have asked for a copy so far, that seems unlikely... -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Feb 19 13:01:10 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Feb 2010 16:01:10 EST Subject: [gutvol-d] banana cream Message-ID: <1115b.4e4ecfd9.38b05616@aol.com> hey guys- here's where you can get a windows version of banana cream: > http://z-m-l.com/go/bananacream/banana-cream2008.exe you'll also want to download this sample text: > http://z-m-l.com/go/bananacream/tjbus.zml put these two files in a folder of their own, and then rename the program to "tjbus.exe"... (by default, it loads the .zml file with the same name.) the program will then download the scans for that book ("the jungle", by upton sinclair) automatically from my site. (you can control the download method under one of the menus.) if you have any questions, let me know. i'd prefer to have the chance to address any complaints before you go public, but i have no desire for you to muzzle your truth however you see it. as stated earlier, i'd prefer you not distribute the app. thanks. -bowerbird p.s. the app follows my naming conventions, which require a 5-letter prefix at the start of each filename (e.g., "tjbus"), followed by a single letter declaring the type of page (either "c" for a cover, or "f" for forward matter, or "p" for a page), followed by the page-number (padded out to three places). if the page was unnumbered, you use the last page number, and append an "a", "b", "c", "d", etc., respectively, as needed. it's also the case that the .zml file needs a certain structure. i'll write something up for that, and send it along to you later. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Feb 19 13:10:54 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Feb 2010 16:10:54 EST Subject: [gutvol-d] oops Message-ID: <115f7.768c9536.38b0585e@aol.com> crap. i meant to send that only to gardner and joey, not to the list. i don't care who gets the app, but i _would_ like to know who. so if you download it, please send me a backchannel saying so. (and if you download it more than once, let me know that too.) if there are unaccounted downloads, i'll have to delete the file, because i don't want too many copies out in the wild right now, beings that this is not meant as a finished copy in any respect... and since the cat is out of the bag, here's the mac version: > http://z-m-l.com/go/bananacream/bc2008f.app.zip -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Feb 19 16:20:54 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Feb 2010 19:20:54 EST Subject: [gutvol-d] ok, let's take a look at gardner's book, just for the exercise Message-ID: <16399.189b0d8d.38b084e6@aol.com> i said: > no, in such situations, i'll download the o.c.r. from archive.org, > which _does_ still contain pagebreak info, and unwrapped text, > and end-line hyphenates. and then i will use your proofed text > to make the corrections to the archive.org o.c.r. and then i will > throw your text away, and keep the corrected, unwrapped and > page-marked text with the original end-line hyphenates in it... it occurs to me that it would be quite instructive to demo this exercise. i'll be using gardner's book to show how i'd go through this process. to prep, i downloaded the scans for his book from canadia.com, and mounted them on my website, along with a skeleton copy of the text. here's a sample url: > http://z-m-l.com/go/gardn/gardnp123.html as you can see, the prefix for this book is "gardn", so if you put a copy of the banana-cream program in a folder of its own, and name it "gardn.exe", it will download the .zml file and the scans from the website, and you'll be able to see how to start to use it. -bowerbird p.s. mac users should name the app "gardn.app", of course... (or, since your .app extensions are likely hidden, just "gardn".) -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Fri Feb 19 18:33:13 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Fri, 19 Feb 2010 21:33:13 -0500 Subject: [gutvol-d] Re: Bowerbird's software projects In-Reply-To: <4f.5dec8603.38afb566@aol.com> References: <4f.5dec8603.38afb566@aol.com> Message-ID: <4B7F49E9.4090705@teksavvy.com> On 19-Feb-2010 04:35, Bowerbird at aol.com wrote: > > that's a sign of an inefficient workflow. > I don't believe that a single pass is feasible, in particular for mismatched quotes and spaced quotes. You fix the open quote, or in my case close more often than not, then that reveals another quote problem further down/up. In any event I am not troubled by multiple passes. > probably rewrapped the text too, am i right? and dehyphenated? Well it *is* a Gutenberg text after all. > which is not to say that your proofing work was a waste of time. > Thanks. > > and when i throw away your text, i throw away your credit-line. > Sure. The book *is* public domain after all. Do what you like. ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From Bowerbird at aol.com Fri Feb 19 20:57:09 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Feb 2010 23:57:09 EST Subject: [gutvol-d] Re: Bowerbird's software projects Message-ID: <1bbf9.4e21db05.38b0c5a5@aol.com> gardner said: > http://unixcomputer.net:81/new-photo/cd/Advocate/ i scraped them from canadiana.org myself... :+) but hey, do you still have your original o.c.r.? or a latest version that has the original linebreaks intact? *** > I don't believe that a single pass is feasible, ok, i should elaborate. multiple passes, to check different aspects, will be required. but multiple passes to check the _same_ aspect are inefficient. > I don't believe that a single pass is feasible, > in particular for mismatched quotes and spaced quotes.? i believe you're wrong, and that i can show you. > in particular for mismatched quotes and spaced quotes.? > You fix the open quote, or in my case close more often than not, > then that reveals another quote problem further down/up.? i have already demonstrated that you can fix spacey quotes, and -- in the vast majority of cases -- fix 'em automatically. leading and trailing spacey quotes are easy to fix, of course. from there, it's a simple matter of segmenting the text into _paragraphs_, and counting quotemarks in each paragraph, making sure that the odd ones are open, and the even closed. then when you come upon a spacey quote, fix it to be open if it is an odd one, and fix it to be close if it is an even one. if you come up against a case where there is an odd number of quotes in a paragraph, and the next paragraph does not start with a quote, then you have a case you need to look at. similarly, if any of the quotemarks come up as the wrong type (an odd that's close, or an even that's open), you need to look. you can test this for yourself. you'll find that it's very robust. usually there's no need to spend much time on spacey quotes. > In any event I am not troubled by multiple passes. ok. > Well it *is* a Gutenberg text after all. right. that point wasn't directed at you, as you correctly realized. > Thanks. well, the fact that you haven't wasted your time is only _part_ of the equation. the fact that you won't get much credit down the line (because _your_ text will be discarded because you threw away info that people will want) is yet another (bigger) part of the equation. > Sure.? The book *is* public domain after all.? Do what you like. i think you missed the point. you can mount a version of your work that doesn't throw away the important information, and then no one will have to re-do it, in which case they will be happy to continue to give you the credit. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Fri Feb 19 21:19:50 2010 From: dakretz at gmail.com (don kretz) Date: Fri, 19 Feb 2010 21:19:50 -0800 Subject: [gutvol-d] Re: Bowerbird's software projects In-Reply-To: <1bbf9.4e21db05.38b0c5a5@aol.com> References: <1bbf9.4e21db05.38b0c5a5@aol.com> Message-ID: <627d59b81002192119g1d7db5eau5c3fe171c424a4d0@mail.gmail.com> I'll concur on the spacey quotes. Twister has a tab just for those. You pick whether to visit all quotes or only anomalies, based on spacing restarting every paragraph. I never bother to visit all any more. It just pops from one bad quote-pair to the next, highlights the whole thing, and offers a button to realign it correctly. If it's just a spacing problem, not missing one end or the other; or some other usage (e.g. inches, dittos, etc.) it always gets it right. This is probably the fastest of all the regex checks that require visual inspection. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pglaf.org Fri Feb 19 21:20:48 2010 From: hart at pglaf.org (Michael S. Hart) Date: Fri, 19 Feb 2010 21:20:48 -0800 (PST) Subject: [gutvol-d] Roundlessness In-Reply-To: <16399.189b0d8d.38b084e6@aol.com> References: <16399.189b0d8d.38b084e6@aol.com> Message-ID: No matter how much I trust anyone, I always make my own "last pass" at any eBook I have the final responsibility for, including quotes, which obvious CAN add up differently after space removal, etc., and the always seems to be at least one "torn margin," etc. It's just nice to have a pair of human eyeballs as the last resort, even when you prepared the entire book all by yourself. I think it may be the case that I ALWAYS found at least one more error even if it is just the most cursory pass. Sometimes I just insist on quite literally working on it until I find ONE more error just to prove I was there doing my two cents worth. Thanks!!! Michael From gbuchana at teksavvy.com Sat Feb 20 10:52:20 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Sat, 20 Feb 2010 13:52:20 -0500 Subject: [gutvol-d] Re: so what is so important about pagination? In-Reply-To: <1bbf9.4e21db05.38b0c5a5@aol.com> References: <1bbf9.4e21db05.38b0c5a5@aol.com> Message-ID: <4B802F64.3040909@teksavvy.com> On 19-Feb-2010 23:57, Bowerbird at aol.com wrote: > (because _your_ text will be discarded because you threw away info > that people will want) is yet another (bigger) part of the equation. > > > you can mount a version of your work that doesn't throw away > the important information, and then no one will have to re-do it, > I'm going to take this as a jumping off point for a more general question about whether pagination of a published edition, is worth saving. Obviously there is a range of opinion. I'll give you mine. What I believe, philosophically, I am shooting for is to capture the core content, and reject the details that have mainly to do with the medium of publication. So at the top level, I think the text itself and notions like block quotations, poetry layout, italics and stuff I keep. Stuff that is a function of the fact it was printed on little rectangles of paper -- hyphenation page numbering, line ends, I believe I do not have any use for. Maybe there are possible future uses of my text that would want the things that I left out. I tend to doubt that this could ever be very important. If I take for example what scholarly editions tend to do, they focus on the text, tend to combine information from different printings and editions, and winnow out and reject the artefacts of hyphenation and pagination. The seek out and highlight even small differences in the text, but go to pains to filter out hyphenation artefacts. In the grand scheme of things, there were undoubtedly interesting things in earlier versions of a book than what we have -- the author's manuscript, editors notes, even the setter's notes all would be very interesting things to have. But if I think what value I could get from having the author's manuscript I do not picture knowing the pagination or line endings of a longhand manuscript as being of foremost importance. Obviously others feel like preserving page numbers is worthwhile -- I see that most PG-Canada texts have this. As an individual contributor I do not feel that my time is best spent capturing and encoding that, and so I don't. And I am happy that PG finds my efforts acceptable despite this deficiency. I haven't done any sort of real research, but a quick look shows me that not many texts attempt to preserve line endings in any way. Preserving line endings seems quite unpopular. My question to the pagination-preservers is: what is the difference? Both hyphenation, line-endings and pagination are mainly artefacts of the physical medium -- one of width and the other of height. Bowerbird wants to keep both; I see no need to keep either. But what is the reasoning behind keeping one (pagination) and not the other? ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From klofstrom at gmail.com Sat Feb 20 11:08:39 2010 From: klofstrom at gmail.com (Karen Lofstrom) Date: Sat, 20 Feb 2010 09:08:39 -1000 Subject: [gutvol-d] Re: so what is so important about pagination? In-Reply-To: <4B802F64.3040909@teksavvy.com> References: <1bbf9.4e21db05.38b0c5a5@aol.com> <4B802F64.3040909@teksavvy.com> Message-ID: <1e8e65081002201108k3d84b248pd50b4548c2b95720@mail.gmail.com> On Sat, Feb 20, 2010 at 8:52 AM, Gardner Buchanan wrote: > My question to the pagination-preservers is: what is the difference? Pagination is crucial if you're talking about the text to someone else -- whether in a scholarly context, or just referring to a certain passage when writing a review. If you say, "Nina is called a gypsy on page .89 of the 1899 edition", someone else can find the passage and check your assertion. If you say, "Somewhere in the first third of the book, Nina is called a gypsy," people won't be able to find it. Even if you are reading on a device that does search easily, you'd still have to pull up and scour all mentions of gypsies. Pagination isn't a perfect reference method. If you're in a class where they're reading Gaskell's North and South, say, and the teacher is referring to a modern reprint and you've got an ebook version of the first edition, with the first edition pagination, you're going to have to do some searching. You'll probably find what you want within a range of a few pages, however. The best method is the one used for religious texts: giving chapter and verse. That reference is invariant across all versions. Perhaps we'll adopt that eventually for ALL texts. Until then, pagination is a next best. -- Karen Lofstrom From Bowerbird at aol.com Sat Feb 20 12:11:31 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 20 Feb 2010 15:11:31 EST Subject: [gutvol-d] Re: so what is so important about pagination? Message-ID: <28809.577ebb12.38b19bf3@aol.com> gardner said: > I'm going to take this as a jumping off point > for a more general question about > whether pagination of a published edition, > is worth saving.? Obviously there is a range of opinion.? > I'll give you mine. yes, there is a range of opinion. i can give arguments -- even good arguments -- on both sides. which obviously means that _some_ people have good arguments for retaining pagination. and you disenfranchise those people entirely when you throw out the pagination, no matter how good your intentions might be for doing so. i'd rather not disenfranchise those people... so i think it's necessary to include the pagination, and the original linebreaks, with end-line hyphenates. now, i think it's _imperative_ that we give people tools that enable them to discard that pagination, and unwrap those original linebreaks, and rejoin end-line hyphenates. to do otherwise would be to disenfranchise _those_ people, and i'd rather not disenfranchise them either. so, for me, the answer to the question is extremely simple. > What I believe, philosophically, I am shooting for > is to capture the core content, and reject the details > that have mainly to do with the medium of publication. i can understand that perspective. i can also understand the other perspective. and i see no reason that anyone has to be unhappy here. it's very important to understand that this does _not_ have to be an either/or question. we _can_ do _both_... > Maybe there are possible future uses of my text > that would want the things that I left out.? > I tend to doubt that this could ever be very important.? well, then, your imagination is starting to lag behind... :+) because we are now right in the middle of a situation here where "the things that you left out" _are_ "very important", namely a reproofing of your book, to test your accuracy... it's an order of magnitude more difficult to proof a book when the text has lost all of its linebreaks and pagination. are you of the opinion that the future will simply _accept_ that you did a perfect job in the digitization of your books? or do you think they will want to verify the quality of them? if you make it too difficult for them to undertake that job, they will just toss out your text and start anew. your loss... > As an individual contributor I do not feel that my time > is best spent capturing and encoding that, and so I don't. except the info was already captured. then you threw it away. > And I am happy that PG finds my efforts acceptable > despite this deficiency. except the future will throw out all the d.p. works because your deficiency is shared by the entire d.p. corpus, sadly... (even the d.p. people who save pagination toss linebreaks.) > I haven't done any sort of real research, but > a quick look shows me that > not many texts attempt to preserve line endings in any way.? > Preserving line endings seems quite unpopular. the future needs to future-proof tens of millions of books... they can afford to throw out everything done up to this point, if they feel they need to, and they will, they most certainly will. (and advances in o.c.r. and o.c.r. correction will make it easy.) (well, as i pointed out, they might use some of the current texts to proof the new o.c.r. that they do, but then they'll toss them.) > Bowerbird wants to keep both actually, i don't need to take a personal stand on the issue, not as an end-user. and that's a good thing, because often i don't have a need for the original linebreaks or pagination. so i definitely want the option of discarding that information. what i am saying is that, as a "best practice" for digitization, the discarding of such information is clearly a terrible mistake. and if you're doing it simply because _you_ don't see a need for that information, then you're being selfish and shortsighted. plus you're giving an ultimatum to people who want that info: they're forced to toss your text as failing to meet their needs. and i am positive that you're going to lose that bet, gardner... -bowerbird p.s. by the way, i found 23 discrepancies in the paragraphing between your version of "the advocate" and archive.org's o.c.r. however, 21 were errors in the o.c.r., and only 2 in your book. the two errors in your book, both of them missed paragraphs: > http://z-m-l.com/go/gardn/gardnp087.html > "What, take the bird back to the bush where we > http://z-m-l.com/go/gardn/gardnp101.html > Yet who shall blame the sun and moon for that? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Feb 20 13:24:03 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 20 Feb 2010 16:24:03 EST Subject: [gutvol-d] Re: Bowerbird's software projects Message-ID: <29dd8.5294e4e8.38b1acf3@aol.com> i've uploaded the archive.org o.c.r. for "the advocate" -- the book that gardner suggested that i look at -- into the skeleton that was previously being used, at: > http://z-m-l.com/go/gardn/gardnp123.html the last page of the book shows the changes i made: > http://z-m-l.com/go/gardn/gardnp126.html the .zml that made these .html pages, as usual, is at: > http://z-m-l.com/go/gardn/gardn.zml if you compare the .zml file to the original o.c.r., you can see that it is very similar. it doesn't take much work to massage typical o.c.r. output to .zml. as no proofing has been done yet, the text is raw... (although this book came from the internet archive, it is a copy of a google book, which means the o.c.r. is very shoddy, since google puts out low-res scans. when archive.org does o.c.r. on its own scan-sets, the o.c.r. is fairly good, since they're using abbyy.) ordinarily at this point, in order to clean the o.c.r., i'd restore the linebreaks to gardner's p.g. e-text by using the linebreaks from the o.c.r. as a guide... however, gardner sent me a copy of the text as it was _before_ he rewrapped the original linebreaks, so i won't need to go through that boring exercise. i decided to post this o.c.r. text anyway, just so you could see what it would look like as it is "in process". at this point, the structure of the thing is pretty solid, in the sense that all of the paragraphing is correct, and the chapter-heads are in place, and all of that, it's just the scanning errors make the thing awful... but if you look past those scanning errors, you can see why this version is superior to the p.g. e-text: it is obviously self-validating against the p-book from which the text claims to have been generated, since it is easily compared to scans of that p-book. (or even against an actual hard-copy, if you prefer.) even if the p.g. text _is_ accurate, it can't be _verified_ as accurate, not quickly and easily, like this text can... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Sat Feb 20 13:53:23 2010 From: jimad at msn.com (Jim Adcock) Date: Sat, 20 Feb 2010 13:53:23 -0800 Subject: [gutvol-d] Re: DP: was rfrank reports in In-Reply-To: <1e8e65081002121127r77da6414y32a6f4b00f35d6fc@mail.gmail.com> References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <1e8e65081002121127r77da6414y32a6f4b00f35d6fc@mail.gmail.com> Message-ID: >There's no forcing going on. The policy from Day One has been that we work on what the content providers submit. Sometimes works that look enticing or valuable to them aren't appealing to the proofers, and then take a long time to wend their way through the system. (Some texts, like Greg Week's science fiction stories, zip through in days.) Works that CP'ers submit which are stuck on the queues AREN'T being worked on. People who volunteer for DP are forced to work on things not stuck on queues. That IS the forcing going on. Work that progresses slowly through the Proofing rounds aren't really the problem. The problem is more works that get stuck in the formatting rounds and the PP rounds. What I've seen stuck in the proofing rounds has sections, such as huge sections of publishers ads, or indexes, which most Proofers get tired of pretty quick -- especially when the work is classified as "Easy." I would question the judgment of including publishers' ads when they aren't even numbered pages nor relate to the subject matter. Let's try to break this down again in a way that SHOULDN'T be controversial: 0) Premise: DP people ARE acknowledging that having books stuck on queues 3.5 years is not a good thing. If this is NOT a good thing, then SOMETHING has to change. If one wants to change the queuing times there is really ONLY a couple things fundamentally that one can change: 1) You can reduce that rate at which content is placed onto the queues. That implies SOME kind of principle of selection. The principle right now is "First Come First Serve." I suggest this is not a good thing for several reasons: Books may be put on the queue that people really don't want to work on. Books may be put on the queue that people really don't want to read. And books may be put on the queue that take time and energy disproportional to the societal benefit to be gained from that book compared to some other books. Note there are about 50 million books available worldwide that could be worked on by DP, compared to 2500 roughly a year created by DP, implying a queuing time for books in general of 20,000 years -- not including those books that will have risen to the public domain in those 20,000 years! Another way of saying this is that the selection process used to decide which books get "rescued" by DP is on the order of 1 book in 10,000 gets saved. Now, if only one book in 10,000 gets saved, should this be "at random" or should there be some kind of selection process -- even if it were only that the DP volunteers who are going to do the work vote on what gets put on the queue? 2) You can increase the rate at which content is taken off the queues. This requires placing more resources at those places in the queues where things are getting bogged down, which are P3, F2, and PP. To place more resources at these places requires at least SOME tweaking of DP's current system of "technological high priesthood" and would require getting over DP's current idea that somehow they are creating "perfect books" [which they certainly are NOT doing!] 3) You can increase productivity by improving tools -- particularly tools helping P3, F2, and PP. Producing tools that help P1 is pretty easy, as many people have suggested, but, it is actually NOT obvious that improving tools for P1 would prove to be helpful to DP overall! Making P1 faster and easier without changing the current rules of "technological high priesthood" will actually only make the queuing problems more extreme. From jimad at msn.com Sat Feb 20 14:00:23 2010 From: jimad at msn.com (Jim Adcock) Date: Sat, 20 Feb 2010 14:00:23 -0800 Subject: [gutvol-d] Re: DP: was rfrank reports in In-Reply-To: <18CC2C23FCF249DEA672595196E236B2@alp2400> References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <18CC2C23FCF249DEA672595196E236B2@alp2400> Message-ID: >"revised" version is going to appear in a few days/weeks/months... We are not talking about a "revised" version showing up in a couple months. Rather, we are talking about doing a posting which includes HTML 3.5 years later. Verses posting the txt version now rather than later, and thereby increasing the total collection size of PG by 20%. One could argue that this would make the whitewashers job easier rather than harder -- because then Al wouldn't have to put up with random submissions from people like me who give up on DP and "route around damage" [thereby introducing "damage" of our own! :-] From greg at durendal.org Sat Feb 20 14:57:24 2010 From: greg at durendal.org (Greg Weeks) Date: Sat, 20 Feb 2010 17:57:24 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Re: DP: was rfrank reports in In-Reply-To: References: <8005.73d837a3.38a06575@aol.com> <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400> <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com> <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com> <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com> <18CC2C23FCF249DEA672595196E236B2@alp2400> Message-ID: At least one of the discussions going on was exactly the HTML coming a few weeks after the text scenario. This was the go through all the rounds and the PPer posts the text version as soon as it's done and posts the html later. This didn't seem like a terribly useful approach to me as the html version of the text is typically NOT where the bottleneck is at DP. Of course there was at least five other aproaches being discussed in the thread. Greg Weeks On Sat, 20 Feb 2010, Jim Adcock wrote: >> "revised" version is going to appear in a few days/weeks/months... > > We are not talking about a "revised" version showing up in a couple months. > Rather, we are talking about doing a posting which includes HTML 3.5 years > later. Verses posting the txt version now rather than later, and thereby > increasing the total collection size of PG by 20%. One could argue that > this would make the whitewashers job easier rather than harder -- because > then Al wouldn't have to put up with random submissions from people like me > who give up on DP and "route around damage" [thereby introducing "damage" of > our own! :-] > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -- Greg Weeks http://durendal.org:8080/greg/ From jimad at msn.com Sat Feb 20 15:13:20 2010 From: jimad at msn.com (Jim Adcock) Date: Sat, 20 Feb 2010 15:13:20 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <4B76F4C6.3030006@teksavvy.com> References: <4B76F4C6.3030006@teksavvy.com> Message-ID: I do "solos" given my frustration level with DP -- where I've submitted two really good books but none have made it back out of the system. IMHO setting up a book to go through the DP system aka Content Providing isn't a whole lot less work than just doing the whole book for myself in the first place. Not entirely happy working with myself either -- going it alone is a bit of slog for me -- but my tolerance level for wasting time is about one month -- which is about how long it takes me to make a book while working around various family emergencies -- as compared to 40 months for DP. And with DP nothing happens for months or years at a time -- and then the people there are unhappy with you if you happen to be out of town if and when your book pops off a queue and "goes active". What I wish is that DP had a "Fast Trackers" division of people interested in and committed to turning books out quickly, so that one could see a project from beginning to end. I still proof at DP occasionally when I have excess energy -- but not enough to start my own new book project again! From jimad at msn.com Sat Feb 20 15:48:17 2010 From: jimad at msn.com (Jim Adcock) Date: Sat, 20 Feb 2010 15:48:17 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <9F120957CF48439F9C63FD74DE1B25F7@alp2400> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <9F120957CF48439F9C63FD74DE1B25F7@alp2400> Message-ID: >.... but there's currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to fix this kind of thing. (Probably simpler to just re-do this text from scratch, which is something *I'm* not about to do.) OK, HOW ABOUT a mechanism for fixing and/or improving things that were done in the past that now look old and crufty by today's standards? -- whether redoing something originally created by DP or by a solo? Certainly WW shouldn't be the only way to fix old cruft. If someone wants to take on a "redo and improve" what does it take? Many of the things that actually get read at PG are pretty old and crufty! -- I haven't been willing to take on any of the Ye Olde Cruft for fear of pushback. From jimad at msn.com Sat Feb 20 15:50:30 2010 From: jimad at msn.com (Jim Adcock) Date: Sat, 20 Feb 2010 15:50:30 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: <9F120957CF48439F9C63FD74DE1B25F7@alp2400> References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <9F120957CF48439F9C63FD74DE1B25F7@alp2400> Message-ID: >In short, DP's current processes produce error-free texts.... I will disagree with this, at least given that DP's current processes introduce punc errors pretty much by design. From greg at durendal.org Sat Feb 20 16:06:07 2010 From: greg at durendal.org (Greg Weeks) Date: Sat, 20 Feb 2010 19:06:07 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> Message-ID: On Sat, 20 Feb 2010, Jim Adcock wrote: > What I wish is that DP had a "Fast Trackers" division of people interested > in and committed to turning books out quickly, so that one could see a > project from beginning to end. Don (dkretz) and I and a small team experimented with this a few weeks ago. It's entirely possible to do this within the current DP constraints. I think we took about two weeks for the short we used. That wasn't the main purpose of the experiment, but the short period of time was one of the constraints for what we want to test. -- Greg Weeks http://durendal.org:8080/greg/ From hart at pglaf.org Sat Feb 20 16:33:08 2010 From: hart at pglaf.org (Michael S. Hart) Date: Sat, 20 Feb 2010 16:33:08 -0800 (PST) Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <9F120957CF48439F9C63FD74DE1B25F7@alp2400> Message-ID: Let's just forget the whole idea of error free texts. . . . Ever since I started Project Gutenberg I've never seen even one book I read, even most articles and essays, without big bluders you would think could never be published. I would prefer just to get these materials in circulation-- then worry about approaching perfection along with Xeno. Does anybody have a serious objection to putting the 8,000, or so, books that were listed earlier as being in limbo, in something like our "PrePrints" section, where we put eBooks that are admittedly not ready for prime time??? Please. . . . Michael On Sat, 20 Feb 2010, Jim Adcock wrote: > >In short, DP's current processes produce error-free texts.... > > I will disagree with this, at least given that DP's current processes > introduce punc errors pretty much by design. > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From jimad at msn.com Sat Feb 20 16:43:55 2010 From: jimad at msn.com (Jim Adcock) Date: Sat, 20 Feb 2010 16:43:55 -0800 Subject: [gutvol-d] Kindle for Blackberry In-Reply-To: <1cd29.10b0e871.38a990fb@aol.com> References: <1cd29.10b0e871.38a990fb@aol.com> Message-ID: Amazon has released "Kindle for Blackberry" for free at: http://www.amazon.com/gp/feature.html/ref=klm_lnd_inst?docId=1000468551 I don't personally own a Blackberry, so I can't report on this one in specific. Typically would allow one to read "for pay" Amazon books plus free public domain books including PG in MOBI format. Why would one care? i) Yet another "free reader" software for cell phone devices. ii) Good for people interested in making MOBI versions of PG books -- or checking out how their DP efforts look like once translated by PG to MOBI format and from there to people's cell phones. iii) check out "the competition." From greg at durendal.org Sat Feb 20 16:45:39 2010 From: greg at durendal.org (Greg Weeks) Date: Sat, 20 Feb 2010 19:45:39 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <9F120957CF48439F9C63FD74DE1B25F7@alp2400> Message-ID: On Sat, 20 Feb 2010, Michael S. Hart wrote: > Does anybody have a serious objection to putting the 8,000, > or so, books that were listed earlier as being in limbo, in > something like our "PrePrints" section, where we put eBooks > that are admittedly not ready for prime time??? Yea, there are people arguing that it's a horrible thing to do. I'm 100% with you on this. Available with a few errors is far more useful than unavailable. And it's not that they aren't actually available now, they are. DP has always had the concatenated text available for download. It's behind a sign on and not indexed by any of the search engines, so if you don't know it's there already you can't find it. -- Greg Weeks http://durendal.org:8080/greg/ From ajhaines at shaw.ca Sat Feb 20 17:08:18 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sat, 20 Feb 2010 17:08:18 -0800 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <9F120957CF48439F9C63FD74DE1B25F7@alp2400> Message-ID: <3EBEA039872B469FBC6A629378263ACB@alp2400> Any "mechanism" is informal, at best, and there's no list of old submissions that would benefit from being re-done. To use as an example, Arizona Sketches, by J. A. Munk, PG#756. Internet Archive has a number of source copies. In 2008, I cleaned up PG's text file, made corrections, and created an HTML version. It's missing all illustrations, any Latin1 characters, and so forth. If the only intent is to correct a current PG etext, the corrected text and HTML files can be sent to PG's Errata system. Do not reformat the files, so that the corrected ones can be compared to the posted ones. It might take a few days for the WWers to deal with such submissions, but they *will* be dealt with. However, if you want to add illustrations, or any other material that may be missing from the posted files, you'll have to submit a copyright clearance for the source edition, do whatever is needed to add the missing material to the posted files, do a thorough check/correction of those files from the source, then upload everything as normal, mentioning in the Note to Whitewashers field that the submission is intended as an update to an existing etext. The WWers will decide whether to post the new submission as a new etext, or to replace (and archive) the existing files. If the latter is chosen, the original submitter's credit will be added to the new version's Credit line. ----- Original Message ----- From: "Jim Adcock" To: "'Project Gutenberg Volunteer Discussion'" Sent: Saturday, February 20, 2010 3:48 PM Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? > >.... but there's > currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to > fix this kind of thing. (Probably simpler to just re-do this text from > scratch, which is something *I'm* not about to do.) > > OK, HOW ABOUT a mechanism for fixing and/or improving things that were > done > in the past that now look old and crufty by today's standards? -- whether > redoing something originally created by DP or by a solo? Certainly WW > shouldn't be the only way to fix old cruft. If someone wants to take on a > "redo and improve" what does it take? Many of the things that actually get > read at PG are pretty old and crufty! -- I haven't been willing to take on > any of the Ye Olde Cruft for fear of pushback. > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From jimad at msn.com Sat Feb 20 17:35:39 2010 From: jimad at msn.com (Jim Adcock) Date: Sat, 20 Feb 2010 17:35:39 -0800 Subject: [gutvol-d] Re: so what is so important about pagination? In-Reply-To: <28809.577ebb12.38b19bf3@aol.com> References: <28809.577ebb12.38b19bf3@aol.com> Message-ID: Pagination is not necessarily the same thing as page numbers. I like retaining some notion of page numbers even if it is just in the form of invisible or semi-invisible HTML. I also like retaining original linebreaks info to assist future proofing or reworking passes -- which again is not the same as displaying original linebreaks. I dislike anything that prevents reflow, which I think is necessary for the enjoyment of most users. From Bowerbird at aol.com Sat Feb 20 17:52:13 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 20 Feb 2010 20:52:13 EST Subject: [gutvol-d] [SPAM] re: Many solo projects out there in gutvol-d land? Message-ID: <2ecd9.15b98434.38b1ebcd@aol.com> michael said: > Let's just forget the whole idea of error free texts. . . . well, that's going a bit overboard. better, let's try to achieve perfection, but let's not let that high goal get in the way of making work available even if it's not yet perfect... > I would prefer just to get these materials in circulation-- > then worry about approaching perfection along with Xeno. yes, except i don't see any part of project gutenberg that is doing very much at all in the way of "approaching perfection". once a text is posted, people seem to forget it completely... even when the whitewashers "correct" an e-text, they aren't doing nearly what they could be doing in order to improve it. it seems like there is a constant mad dash to do new books, but almost nothing is being done to fix up any older books. i asked at 10,000 books for a review in terms of quality control, and again at 15,000, and again at 20,000, and again at 25,000. i didn't bother to ask again at 30,000, because what's the use? but at some point, some hard questions will need to be asked... > Does anybody have a serious objection to putting the 8,000, > or so, books that were listed earlier as being in limbo, in > something like our "PrePrints" section, where we put eBooks > that are admittedly not ready for prime time??? well, i certainly don't... but many of the volunteers over at distributed proofreaders do. indeed, according to a poll (which has now received one of the highest number of votes on any poll that has been done there), there are split evenly -- right down the middle -- on this issue. i don't know what to make of that. but that's the way it is. it's also worth mentioning that those 8,000 books are _not_ "almost done". some of them really aren't even very close... some are full of typos, still. most contain pseudo-markup, which really should be converted to something more useful before the books are ever put in front of the general public. lots contain "proofer's notes", which would confuse people. it should also be noted that many of them are not in english, which might (or might not) have bearing on the question, but since i only speak english, i wouldn't have any idea what it is. considering all this, it would _not_ be a simple procedure to free up this matter. it could use up a lot of time and energy, and for very little benefit in return. (does any use "preprints"?) what _would_ be useful is for this material to be put on a wiki, in order to test notions of public postprocessing collaboration. instead of saying "here, take this unfinished work", we _should_ instead be saying "here, come help finish this unfinished work". at one point, i was tempted to build such a postprocessing system. but then i realized i didn't want to help d.p. get over their backlog; d.p. deserves to suffer the consequences of their terrible workflow, or they'll _never_ be motivated to fix it... so i decided to let it be... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Feb 20 17:54:56 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 20 Feb 2010 20:54:56 EST Subject: [gutvol-d] [SPAM] re: Re: so what is so important about pagination? Message-ID: <2ed73.402f31ef.38b1ec70@aol.com> jim said: > I dislike anything that prevents reflow, > which I think is necessary for the enjoyment of most users. there is absolutely nothing about retaining pagination (or linebreaks, or end-line hyphenates) that "prevents reflow", jim, and i wish you would stop repeating that nonsense. you need to pay better attention. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Feb 20 18:15:02 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 20 Feb 2010 21:15:02 EST Subject: [gutvol-d] Re: ok, let's take a look at gardner's book, just for the exercise Message-ID: <2f260.45cb45b8.38b1f126@aol.com> i've merged gardner's corrections into the archive.org o.c.r., and posted the results on my website. here's a sample url: >?? http://z-m-l.com/go/gardn/gardnp123.html gardner dehyphenated end-line hyphenates, so i rejoined (some of) them. i'll write another routine to do the rest... out of a file that contains about 4,000 lines, there are only 800 (at this point) which differ in the two versions. so even early in the merge, 80% of the o.c.r. lines were right, and that number will increase with more aggressive cleaning. the (presumably incorrect) lines from the o.c.r. are in red, while the lines from gardner's proofed copy are in blue... if you prefer to view all of the edits on one web-page, see: > http://z-m-l.com/go/gardn/gardn-hybrid.html -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Sat Feb 20 19:15:31 2010 From: jimad at msn.com (James Adcock) Date: Sat, 20 Feb 2010 19:15:31 -0800 Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about pagination? In-Reply-To: <2ed73.402f31ef.38b1ec70@aol.com> References: <2ed73.402f31ef.38b1ec70@aol.com> Message-ID: >there is absolutely nothing about retaining pagination (or linebreaks, or end-line hyphenates) that "prevents reflow", jim, and i wish you would stop repeating that nonsense. As always, we talk past each other - I talk about problems in the real world, and Bowerbird responds with hypotheticals from bowerbirdworld. Certainly current PG choice of linebreaks IS causing real world customers from reading PG books on their choice of hardware. I know because I have responded to their complaints about PG brokenness on other forums. Real world customers just want to read books, they don't want to have to route around PG damage. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Sat Feb 20 19:48:05 2010 From: dakretz at gmail.com (don kretz) Date: Sat, 20 Feb 2010 19:48:05 -0800 Subject: [gutvol-d] Re: [SPAM] re: Many solo projects out there in gutvol-d land? In-Reply-To: <2ecd9.15b98434.38b1ebcd@aol.com> References: <2ecd9.15b98434.38b1ebcd@aol.com> Message-ID: <627d59b81002201948n196a7757g620d8c1b8306550d@mail.gmail.com> I think it would certainly get their attention if they were told that Michael S. Hart would prefer that they focus on doing whatever it reasonably takes to remove the notes, standardize the markup, and get them posted. In fact, I might just post something to that effect myself. Let's see ... it's "Ctl-C" here, switch to DP, long in, "Ctl-V", .... On Sat, Feb 20, 2010 at 5:52 PM, wrote: > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Sat Feb 20 21:14:21 2010 From: dakretz at gmail.com (don kretz) Date: Sat, 20 Feb 2010 21:14:21 -0800 Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about pagination? In-Reply-To: References: <2ed73.402f31ef.38b1ec70@aol.com> Message-ID: <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com> It's not trivial that it would make shared proofing a lot easier and less ambiguous. Just match the image. -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Sun Feb 21 06:35:34 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Sun, 21 Feb 2010 15:35:34 +0100 Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about pagination? In-Reply-To: <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com> References: <2ed73.402f31ef.38b1ec70@aol.com> <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com> Message-ID: Hi Don, You are write about this. But, the people over at DP want more. Which is also fine. The problem is that over at DP they seem to me to focused on output formating. What they do seem to understand that you can use markup as pseudo-code or pseudo-mark for that matter. The want to keep as much information as possible. That is not hard if you use pseudo-code. The first step would be as you said is to match the scanned image. So you have a text containing alot of code marking the original linebreaks, chapter beginnings, page marks, page numbers, bold, italics, indentation, images. This markup will not be easily human-readable, but computers do good work of rendering/display in an appropriate fashion. Then all you need is a simple tool that parser this format into the output format you want. e.g for plain text: throw-out page breaks, images, hyphenation convert footnotes to PG Style convert bold, italics, PG Style start output PG Style output PG Header output text PG Style two linebreaks for paragrahs before Chatpters, etc. wrap accordingly This an other simplification. for HTML (everything in one page) throw-out hyphenation create tags for bold, italic create tags for chapter header, with anchors create tags for paragraphs repecting indentation for verse and such. throw-out linebreaks create footnotes. with anchors create tags for images create TOCs You could also have the system produce a more complex HTML-structure, directories for chatpters, one file per page, etc. The same procedure can be applied to other output formats. That is the cool thing about pseudo-code it does not produce output if you do not want it or need it!! regards Keith. Am 21.02.2010 um 06:14 schrieb don kretz: > It's not trivial that it would make shared proofing a lot easier and less ambiguous. > > Just match the image. > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From dakretz at gmail.com Sun Feb 21 08:46:31 2010 From: dakretz at gmail.com (don kretz) Date: Sun, 21 Feb 2010 08:46:31 -0800 Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about pagination? In-Reply-To: References: <2ed73.402f31ef.38b1ec70@aol.com> <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com> Message-ID: <627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com> Keith, I agree 100% I've been arguing markdown and textile - even zml - for years. Don -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Sun Feb 21 09:04:54 2010 From: dakretz at gmail.com (don kretz) Date: Sun, 21 Feb 2010 09:04:54 -0800 Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about pagination? In-Reply-To: <627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com> References: <2ed73.402f31ef.38b1ec70@aol.com> <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com> <627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com> Message-ID: <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com> ReStructuredText is a newer one that seems to be particularly extensible (hence expressive and adaptable.) On Sun, Feb 21, 2010 at 8:46 AM, don kretz wrote: > Keith, I agree 100% I've been arguing markdown and textile - even zml - for > years. > > > Don > -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Sun Feb 21 10:32:00 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Sun, 21 Feb 2010 19:32:00 +0100 Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about pagination? In-Reply-To: <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com> References: <2ed73.402f31ef.38b1ec70@aol.com> <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com> <627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com> <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com> Message-ID: <736D3990-549A-4D41-9498-0586A4881B85@uni-trier.de> Hi Don, I am not talking markdown or Restructured. I am talking about a true markup langauge. as Meta-language XML or TeX can be used. The idea is to have tags which contain information that is not truly foramtting. E.G say a pagenumber Tag just states that this page is pagenumber n it could be intergrateg into page break like \page{5}. You could have a footer of an page that look like this \footer{\right{\bold{page} \italic{5}}}} This footer contains the number of the page but does have anything to due with the page or pagenumber tag. regards Keith. Am 21.02.2010 um 18:04 schrieb don kretz: > ReStructuredText is a newer one that seems to be particularly extensible > (hence expressive and adaptable.) > > On Sun, Feb 21, 2010 at 8:46 AM, don kretz wrote: > Keith, I agree 100% I've been arguing markdown and textile - even zml - for years. > > > Don > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Sun Feb 21 10:51:36 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun, 21 Feb 2010 19:51:36 +0100 Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about pagination? In-Reply-To: <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com> References: <2ed73.402f31ef.38b1ec70@aol.com> <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com> <627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com> <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com> Message-ID: <4B8180B8.4070305@perathoner.de> don kretz wrote: > ReStructuredText > is > a newer one that seems to be particularly extensible (hence > expressive and adaptable.) This is an example of a RST that EpubMaker (the converter that does all PG epubs) can convert to an industrial-strength epub: .. -*- encoding: utf-8 -*- .. meta:: :DC.Creator: Raymond Chandler :DC.Title: The Big Sleep :DC.Language: English :DC.Created: 1939 The Big Sleep by Raymond Chandler ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. contents:: Contents :backlinks: entry Chapter 1 ========= It was about eleven o?clock in the morning, mid October, with the sun not shining and a look of hard wet rain in the clearness of the foothills. I was wearing my powder-blue suit, with dark blue shirt, tie and display handkerchief, black brogues, black wool socks with dark blue clocks on them. I was neat, clean, shaved and sober, and I didn?t care who knew it. I was everything the well-dressed private detective ought to be. I was calling on four million dollars. [...] Chapter 2 ========= [...] -- Marcello Perathoner webmaster at gutenberg.org From gbnewby at pglaf.org Sun Feb 21 11:33:09 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Sun, 21 Feb 2010 11:33:09 -0800 Subject: [gutvol-d] Mirroring the firehost? Re: Re: Many solo projects out there in gutvol-d land? Message-ID: <20100221193308.GE10824@pglaf.org> On Sat, 20 Feb 2010, Michael S. Hart wrote: > > > Does anybody have a serious objection to putting the 8,000, > > or so, books that were listed earlier as being in limbo, in > > something like our "PrePrints" section, where we put eBooks > > that are admittedly not ready for prime time??? > > Yea, there are people arguing that it's a horrible thing to do. I'm 100% with > you on this. Available with a few errors is far more useful than unavailable. > And it's not that they aren't actually available now, they are. DP has always > had the concatenated text available for download. It's behind a sign on and not > indexed by any of the search engines, so if you don't know it's there already > you can't find it. What's the URL? I could set up a nightly mirror... Do they automatically disappear from this area, after they are finally published? -- Greg From greg at durendal.org Sun Feb 21 11:48:38 2010 From: greg at durendal.org (Greg Weeks) Date: Sun, 21 Feb 2010 14:48:38 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Mirroring the firehost? Re: Re: Many solo projects out there in gutvol-d land? In-Reply-To: <20100221193308.GE10824@pglaf.org> References: <20100221193308.GE10824@pglaf.org> Message-ID: On Sun, 21 Feb 2010, Greg Newby wrote: >> had the concatenated text available for download. It's behind a sign on and not >> indexed by any of the search engines, so if you don't know it's there already >> you can't find it. > > What's the URL? I could set up a nightly mirror... > > Do they automatically disappear from this area, after they > are finally published? There's not a single place, you have to walk the projects lists using the search function. They do eventually disapear, but the status changes to posted when they are posted to PG. Do you have a sign on at DP? If so try: http://www.pgdp.net/c/tools/project_manager/projectmgr.php?show=search&title=&author=&language[]=&special_day[]=&projectid=&project_manager=&checkedoutby=&pp_er=&ppv_er=&postednum=&state[]=P3.proj_waiting&n_results_per_page=100 That's everything in the P3 waiting queue. If you pick one from that list. (I'm going to grab one of mine.) http://www.pgdp.net/c/project.php?id=projectID4b5e3e5a9b845&detail_level=3 There's a link titled "Download Concatenated Text" with a download button that will download a zip with the text from the last proofing round. The two queues that are most interesting because they are the largest are the P3 waiting and F2 waiting. -- Greg Weeks http://durendal.org:8080/greg/ From dakretz at gmail.com Sun Feb 21 12:13:14 2010 From: dakretz at gmail.com (don kretz) Date: Sun, 21 Feb 2010 12:13:14 -0800 Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about pagination? In-Reply-To: <4B8180B8.4070305@perathoner.de> References: <2ed73.402f31ef.38b1ec70@aol.com> <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com> <627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com> <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com> <4B8180B8.4070305@perathoner.de> Message-ID: <627d59b81002211213m9de8d27p1f45f8836155e9b7@mail.gmail.com> Right, there must be (and always is, in my experience,) at least one unambiguous and comprehensive mapping between the lightweight markup and whatever XML-style tagging you want to declare. And, in many cases, more than one. But HTML is usually first. TeX would also qualify. It's permissible to have, for instance, light-weight markup for syntactic artifacts. There shouldn't be anything XML can do that can't map to your lwml. Worst case, just incorporate your HTML/XML/TeX directly. The lwml just uses conventions and smaller tags to make the markup readable and more easily editable. If someone thinks this means dumbing-down the markup, then I think they misunderstand the purpose and the execution. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Sun Feb 21 12:25:36 2010 From: dakretz at gmail.com (don kretz) Date: Sun, 21 Feb 2010 12:25:36 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Mirroring the firehost? Re: Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <20100221193308.GE10824@pglaf.org> Message-ID: <627d59b81002211225y4109839bx90dd3179d3190109@mail.gmail.com> Here's the html form with the GET variables that comprise the url.
[OCR]  P1  P2 
For each page, use:
the text (if any) saved in the selected round; or
the latest text saved in any round up to and including the selected round.
(If every page has been saved in the selected round, then the two choices are equivalent.)
All you need then is a list of the project codes. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Feb 21 12:47:34 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 21 Feb 2010 15:47:34 EST Subject: [gutvol-d] Re: so what is so important about pagination? Message-ID: keith said: > I am not talking markdown or Restructured. > I am talking about a true markup langauge. it appears you don't really know what you're talking about, as both markdown and restructured _are_ "true markup languages". you want to invent a new one. fine. go ahead. i did it myself... -bowerbird p.s. don, restructured text is older than markdown, and textile too, as far as i know... it's a reworking of "structured text", which is the granddaddy of all the light markup languages. and -- by the way -- z.m.l. is older than markdown. indeed, one of the few appearances of charlz on this listserve was when he came to announce markdown as a z.m.l. clone... -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Feb 21 13:00:32 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 21 Feb 2010 16:00:32 EST Subject: [gutvol-d] let us not be confused Message-ID: ok, we've got a couple of different topics running around, so let us take a minute to make sure we are not confused... first of all, let's talk about my campaign for preprocessing... i have demonstrated, over and over and over again, that d.p. (and rfrank) should be doing _much_ better preprocessing... i've shown how they can use _very_simple_means_ to do that, and how -- if they did -- they could reduce the error-counts in their books to a ridiculously small amount, even _before_ their text went in front of proofers. i have talked about how it is a huge _waste_ of the generous donations of volunteers (in both time and energy) not to do aggressive preprocessing, which automatically locates errors to make them easy to fix... again, the crux of my argument -- and i have proven it to be absolutely true, again and again -- is that it's _easy_ to do this. indeed, when i have shown the steps taken to locate the errors, it becomes painfully obvious how ridiculously simple they are... they include obvious checks, like a number embedded in a word, or a lowercase letter followed by a capital letter, or two commas in a row, or a period at the beginning of a line. _obvious_ stuff! this isn't rocket science. it's not even _hard_... it's dirt-simple! and yet neither d.p. nor rfrank has instituted such preprocessing. *** let's contrast this with gardner's request, which was to compile a list of reg-ex tests that will locate all possible errors in any random book. this request -- as worthy as it might seem -- is _much_ more difficult to realize. in fact, it's almost impossible. a friend of mine over in england, nick hodson, is a very prolific digitizer. all by himself, he has done some 500 books or more. nick collected an extensive set of checks over the years. i can't remember exactly how many there were, but roughly about 200. however, once nick upgraded his o.c.r. program, he found that about half of his checks were no longer required. they had been necessary essentially as an artifact of an outdated o.c.r. program. the type of books nick was digitizing hadn't changed, and neither had the quality of the scans, or the resolution of the scans, or the digital retouching that he performed on the scans -- none of that. he was the same person, using the same computer and scanner, and he was doing the same things exactly as he had done before. the only thing that changed was the version of his o.c.r. program. yet he found many checks he formerly needed became unnecessary. so, for an operation like d.p., who intakes all kinds of scans and uses a wide variety of o.c.r. programs, operated by users with a huge range of expertise, their results will be all over the board. they're _never_ gonna get a definitive list of checks to be made. it would be _immensely_ difficult, to the point of being impossible. but that's totally beside our other point, about preprocessing... because the fact of the matter is that a few dozen _simple_ tests are all that d.p. needs in order to reduce the number of errors to a level where they can be handled easily by their human proofers. they're never gonna get 100%. but they could find 90% so easily that it's criminal negligence that they aren't doing that already... heck, spell-check by itself will locate 50% of the errors for you... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From traverso at posso.dm.unipi.it Sun Feb 21 13:07:24 2010 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Sun, 21 Feb 2010 22:07:24 +0100 (CET) Subject: [gutvol-d] Re: [SPAM] Re: Mirroring the firehost? Re: Re: Many solo projects out there in gutvol-d land? In-Reply-To: (message from Greg Weeks on Sun, 21 Feb 2010 14:48:38 -0500 (EST)) References: <20100221193308.GE10824@pglaf.org> Message-ID: <20100221210724.0FA49FFB1@cardano.dm.unipi.it> >>>>> "Greg" == Greg Weeks writes: Greg> On Sun, 21 Feb 2010, Greg Newby wrote: >>> had the concatenated text available for download. It's behind >>> a sign on and not indexed by any of the search engines, so if >>> you don't know it's there already you can't find it. >> What's the URL? I could set up a nightly mirror... >> >> Do they automatically disappear from this area, after they are >> finally published? Greg> There's not a single place, you have to walk the projects Greg> lists using the search function. They do eventually Greg> disapear, but the status changes to posted when they are Greg> posted to PG. Greg> Do you have a sign on at DP? If so try: Greg> http://www.pgdp.net/c/tools/project_manager/projectmgr.php?show=search&title=&author=&language[]=&special_day[]=&projectid=&project_manager=&checkedoutby=&pp_er=&ppv_er=&postednum=&state[]=P3.proj_waiting&n_results_per_page=100 Greg> That's everything in the P3 waiting queue. If you pick one Greg> from that list. (I'm going to grab one of mine.) Greg> http://www.pgdp.net/c/project.php?id=projectID4b5e3e5a9b845&detail_level=3 Greg> There's a link titled "Download Concatenated Text" with a Greg> download button that will download a zip with the text from Greg> the last proofing round. Greg> The two queues that are most interesting because they are Greg> the largest are the P3 waiting and F2 waiting. Greg> -- Greg Weeks http://durendal.org:8080/greg/ I have scripts that can download concatenated text scripts without manual handling, and without a browser, but are quite tricky, and I am not willing to discuss them in public, but will provide them to Greg Newby (as DP board member) if he wants. Just send me an email. Carlo From schultzk at uni-trier.de Mon Feb 22 02:01:36 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 22 Feb 2010 11:01:36 +0100 Subject: [gutvol-d] Re: so what is so important about pagination? In-Reply-To: References: Message-ID: <9752B13F-42A6-4469-8BEB-DD3ECEAC06A5@uni-trier.de> Am 21.02.2010 um 21:47 schrieb Bowerbird at aol.com: > keith said: > > I am not talking markdown or Restructured. > > I am talking about a true markup langauge. > > it appears you don't really know what you're talking about, as > both markdown and restructured _are_ "true markup languages". It would be futile to discuss want constitutes a mark language. > > you want to invent a new one. fine. go ahead. i did it myself... No. Not a new mark up language. an encoding or transcription if you wish. Creating the "language" is not the problem. the code base for the tools and getting them to be used by a broad audience. I do not have the time to do so. DP has the infra structure. But what they are missing we have be there and back again. regards Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Feb 22 11:22:28 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Feb 2010 14:22:28 EST Subject: [gutvol-d] Re: so what is so important about pagination? Message-ID: <105c1.42bfe29.38b43374@aol.com> as i said earlier, there's no real need to provide "justification" for pagination. the fact that _some_ people want it is enough to make us decide that we shouldn't toss out that information. however, since nobody has mentioned the _best_ justification for including pagination information, i might as well tell you... here at the end of the paper-book half-millenium, we have roughly 10 million different books out there in the world... (this is according to my memory of recent figures, which may be off, perhaps even by a large amount, but that's immaterial.) if we figure there are an average of 1,000 copies of each book, that means we've got about 10 billion copies of paper-books... that's a lot of paper-books out there in the world. a whole lot. those paper copies are the _originals_, and they always will be. in the future -- even right now, thanks to google -- we have a virtually unlimited number of digital copies of those originals. but again, those digital versions will _always_ be "the copies"... and the paper-books will _always_ be "the originals"... forever. (even books that're "born digital" often become physical quickly, and that will continue into the far future with print-on-demand; and paper-books, due to their _physical_and_material_ nature, will always be the "real" books, while digital versions will always be the "copies", especially since they can be manipulated at will, while physical books have the virtue/liability of being "frozen".) "real" doesn't mean "more valuable" or "more important", it means _physical_ and _tangible_ and _visible_ and _made_out_of_atoms_. you really have to ground yourself in this thinking to understand -- _physical_ books are the "real" ones; digital books are "copies". that's our first important factor... and our second important factor is that e-books are manipulatable. and just as the frozen nature of p-books is both virtue and liability, so too is this manipulability. on the one hand, it's easy to fix errors, provide updates, and so on and so forth... but, on the other hand, it's also easy to alter the book in a way the author did not intend... and if you don't think people _will_ try to rewrite history, you're nuts. plus there's just sheer incompetence, which has already resulted in a number of very shoddy digitizations of books, full of inaccuracies. just try and find all the copies of "pride and prejudice" out there, and then do a determination on which ones are "accurate" and which not. you will find this task to be overwhelming, and nearly impossible, and that's just one book out of our 10 million books. that is the problem. so there's little question that people in the future will be _skeptical_ about each and every e-book which they are handed, and rightly so... for reasons from accidental to quite intentional, it might be inaccurate. so we have a state where there are some "known" p-book "originals", and a ton of digital "copies" that might or might not be "trustworthy". (i believe jon noring has been absent from here for long enough that it's once again safe to use that word without all his derogatory spin.) now, there's only one solution to this state. any specific digital copy will have to be able to _prove_ its correspondence to a paper-copy... the easiest way to provide such proof is to assume the same form as the paper-copy; that is, it must adopt the linebreaks and pagination, so that each and every page can be subjected to visual confirmation... of course, in order to have value as a digital book, the file must be able to drop the linebreaks/pagination, and assume another form, one that reflows to the current set of desires of the end-user, _but_ it _must_ be able to mimic the look-and-feel of the paper-book too. if it cannot, it's simply going to be discarded as being untrustworthy. your e-book cannot afford to be nothing more than a formless blob. it _must_ be able to "snap to" a form that exactly imitates the p-book. and for it to be able to do that, you must keep linebreaks/pagination. it's really that simple. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From lee at novomail.net Mon Feb 22 11:28:21 2010 From: lee at novomail.net (Lee Passey) Date: Mon, 22 Feb 2010 12:28:21 -0700 Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier In-Reply-To: <97FD5D5CD0E846AD94214B14886737BA@alp2400> References: <12985.490903c3.38adae70@aol.com> <97FD5D5CD0E846AD94214B14886737BA@alp2400> Message-ID: <4B82DAD5.801@novomail.net> On 2/17/2010 3:50 PM, Al Haines (shaw) wrote: > Your motivation? What do I care? Be altruistic, and do a book. But at the end of the day, if BB produces a book and gives it to PG (after, of course, posting a copy to the Internet Archive before it is degraded by the whitewashers) you have a book. But if he participates in Mr. Frank's "roundless" experiment, and you both encourage others to do so, at the end of the day you still have your book, probably faster than you would have gotten it otherwise (because I know quantity is important to PG), and perhaps even less error-prone than if a single individual had produced it (even though quality really isn't that important to PG), and BB and Mr. Frank have valuable data that perhaps can be used to develop a more efficient production system. Either way, you still get your book, but in the latter scenario valuable data is produced as well. Be altruistic, Mr. Haines, and support the experiment. From Bowerbird at aol.com Mon Feb 22 11:48:08 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Feb 2010 14:48:08 EST Subject: [gutvol-d] hallelujah Message-ID: <11d7d.11b56de9.38b43978@aol.com> hallelujah! some people are finally talking some sense into rfrank's head. one person suggested some reg-ex tests be shown to proofers. all by itself, this is an improvement, but not that big, because these tests should be done _before_ the text goes to proofers. however, what it did was it jolted roger out of his thinking that such tests are done in _postprocessing_, a huge ideological shift. roger admitted as much. thank you lord! another person pointed out that a global search-and-replace would be a real asset. d'uh, who's been saying that for _years_? one person said: > I've noticed "Pem" scanned as "Pern" a few times. roger responded with: > Done. Fifty-seven replacements. yes! see how easy this can be? so roger said, "tell me what kind of global changes you'd make". so one person came back and said, "how about things like these:" > change did n't to didn't > change could n't to couldn't kinda hard to believe that those fixes weren't already being made in preprocessing, isn't it? but hey, let's be glad for the progress... sure enough, roger got the hint, and changed all of the floating contractions, and then promised he'd do that in _preprocessing_ next time. hallelujah! now is the time to give roger that list of 30 tests that i highlighted back in that month-long series i did... now the _next_ thing would be for someone to _volunteer_ to do all of this preprocessing for roger, using the tool dkretz coded... hallelujah! -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pglaf.org Mon Feb 22 11:57:26 2010 From: hart at pglaf.org (Michael S. Hart) Date: Mon, 22 Feb 2010 11:57:26 -0800 (PST) Subject: [gutvol-d] !@!!@!!@!Re: Re: so what is so important about pagination? In-Reply-To: <105c1.42bfe29.38b43374@aol.com> References: <105c1.42bfe29.38b43374@aol.com> Message-ID: bowerbird says: your e-book cannot afford to be nothing more than a formless blob. it _must_ be able to "snap to" a form that exactly imitates the p-book. and for it to be able to do that, you must keep linebreaks/pagination. /// Making ebooks "a form that exactly imitates the p-book" is a KILLER!!! While he mentions the various eBooks of Jane Austen, he fails to talk about the wide variety of Jane Austen's p-books, and that paginations run rampant among them, not to mention margination, spelling, etc. THERE IS NO SUCH THING AS /ONE/ eBOOK THAT RULES THEM ALL. . . . As any of you who have followed this kind of conversation before know by now, I tried to find just TWO Declaration of Independence copies I could use to say they agreed with each other when I started the first entry in Project Gutenberg. While I do not doubt that somwehere I am likely to be able to FIND two, I did not find such a pair in research of half a dozen copies at the time, nor even two copies that agreed a vast majority of the times there were such issues. IT WAS A COLOSSAL WASTE OF TIME!!!!!!! When I think of going through much longer works. . .well, I do not!!! We went through all of this with Paradise Lost very early on, and the result was that we silenced our "pearls before swine" critics of some very highly places Milton scholars, and it was fun doing so, but that was all there was to it, no real change for the average reader. I am not about to let one person, or journal, however scholarly, make the decision for Project Gutenberg as to what editions to use and how exactly to portray them in whatever format, margination, pagination-- or font, or color, or whatever. If so. . .we are nothing more than a Xerox machine. . . . We should, as always, create something "BETTER THAN THE ORIGINAL!!!" Even if it means ruffling a few feathers. . . . My own dream is a single file, hardly larger than a plain text file, that contains all the editions VOLUNTEERS decide we should have. If the ivory tower is not willing to do that last percent or three-- to create their own "PERFECT" edition--let them whine like swine. From Bowerbird at aol.com Mon Feb 22 13:31:10 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Feb 2010 16:31:10 EST Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about pagination? Message-ID: <15ba7.770b9009.38b4519e@aol.com> michael, i wish you would've taken the time to read what i actually wrote, instead of just giving your kneejerk reaction. because your response doesn't address the point that i made. i am loathe to get into this argument, because it won't mean a thing in the long run... the future will have its own issues, and it will need to deal with them, and i laid it all out clearly. but let me just address a few things, to provide some clarity. > he fails to talk about the wide variety of Jane Austen's > p-books, and that paginations run rampant among them, > not to mention margination, spelling, etc. there are different editions of many books, to be sure... and i count each edition as a separate book. your e-book will have to mimic _one_ of the editions in a faithful manner, or it will be discarded... notice that when i say "mimic", i do _not_ mean that it has to match it _exactly_. so, for instance, if you wanna close up spacey contractions, or correct spelling, or make other kinds of changes, they might (or might not) be totally acceptable to any one specific end-user in the future... but you _will_ have to make it easy for that specific end-user to _compare_ your e-book with a p-book, in order to spot changes. i've shown how this comparison is done, by mounting a web-page which has the text on one side of the screen, the scan on the other. but if your e-book is a formless blob, that's not gonna cut it... and, for the record, i'm most certainly _not_ recommending that we create some "scholarly" version of our books. i laugh at that. the _only_ thing we know about the scholars of the future is that we do _not_ know what they will want and it'd be foolish to guess. put yourself in the shoes of the future. you have a dozen different e-book files, all purporting to be copies of "sense and sensibility". you know that some of them have been doctored, and others have been bowdlerized, and you _hope_ that some of them are accurate. you can, with some work, find the differences between them, but you'd prefer not to have to go through that exercise if you could, because you'd have to then do more work to find the _right_ copy. so, how do you proceed? well, i can tell you that if _one_ of those copies made it _simple_ for you to verify its accuracy by assuming the form of the p-book, that will be your obvious first choice. think about it. you'll agree. so, michael, if you want to respond to this, answer that question. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pglaf.org Mon Feb 22 14:25:27 2010 From: hart at pglaf.org (Michael S. Hart) Date: Mon, 22 Feb 2010 14:25:27 -0800 (PST) Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about pagination? In-Reply-To: <15ba7.770b9009.38b4519e@aol.com> References: <15ba7.770b9009.38b4519e@aol.com> Message-ID: On Mon, 22 Feb 2010, Bowerbird at aol.com wrote: > michael, i wish you would've taken the time to read what i > actually wrote, instead of just giving your kneejerk reaction. And if you weren't so jerkknee you would have realized I had to have read the whole thing to get to the part I quoted. . .duh! When you ask people to pay attention, it helps to PAY ATTENTION. > because your response doesn't address the point that i made. It addresses EXACTLY the point you made that I quoted. . . . If that part contradicts your other points. . .sorry. . . . But having reread all of your comments, I don't see the change you say is there. . . . > i am loathe to get into this argument, because it won't mean > a thing in the long run...? the future will have its own issues, > and it will need to deal with them, and i laid it all out clearly. > > but let me just address a few things, to provide some clarity. > >?? he fails to talk about the wide variety of Jane Austen's > >?? p-books, and that paginations run rampant among them, > >?? not to mention margination, spelling, etc. > > there are different editions of many books, to be sure... > > and i count each edition as a separate book.? your e-book > will have to mimic _one_ of the editions in a faithful manner, Then SAY that!!! Right up front in plain language!!! However, that still relegates us to being a Xerox machine, no? > or it will be discarded...? notice that when i say "mimic", i do > _not_ mean that it has to match it _exactly_.? so, for instance, > if you wanna close up spacey contractions, or correct spelling, > or make other kinds of changes, they might (or might not) be > totally acceptable to any one specific end-user in the future... I'm never going to get into any of these semantic arguments!!!!!!! Mimic means to copy as closely as possible. . . . Synonym: copy. > but you _will_ have to make it easy for that specific end-user to > _compare_ your e-book with a p-book, in order to spot changes. As I have said before, if you would listen, I am not AGAINST keeping a copy with such pagination for such purposes, but I draw the lines, pun intended, at keeping every character in the same page position when there is no need for pages, in all available PG editions. I want our eBooks to be optimally readable: Minimal end of line hyphenation. No page headers or footers. Just plain reading. Once again, I have no stance AGAINST people who want pagination, I just don't want for force any such arbitrary formats on anyone and neither should you or anyone else. STOP TRYING TO FORCE YOUR OPINIONS ON OTHERS, MAKE THEM OPTIONS! > i've shown how this comparison is done, by mounting a web-page > which has the text on one side of the screen, the scan on the other. As I have always said, I have no objection to this in proofreading, just in real reading. . .but I am willing for it to be an OPTION!!! > but if your e-book is a formless blob, that's not gonna cut it... Tell that to the millions of people who prefer remargination to the specifications of their own systems. > and, for the record, i'm most certainly _not_ recommending that > we create some "scholarly" version of our books.? i laugh at that. > the _only_ thing we know about the scholars of the future is that > we do _not_ know what they will want and it'd be foolish to guess. I CAN tell you that most of the paper editions' page numbers will fade along with the hyphenation. > put yourself in the shoes of the future.? you have a dozen different > e-book files, all purporting to be copies of "sense and sensibility". > you know that some of them have been doctored, and others have > been bowdlerized, and you _hope_ that some of them are accurate. Last time I looked there were still pretty ubiquitous programs to lay out all such differences. IFF you have such deep interests, you can simply put up two editions side by side when you look at them. . .I do. . . . If not, then you aren't really that interested. . .it's all smoke. > you can, with some work, find the differences between them, but > you'd prefer not to have to go through that exercise if you could, > because you'd have to then do more work to find the _right_ copy. "_RIGHT_" copy??? Now you've contradicted yourself back into the ivory tower. . . . "_RIGHT_" copy, indeeeeed. . . . > so, how do you proceed? > > well, i can tell you that if _one_ of those copies made it _simple_ > for you to verify its accuracy by assuming the form of the p-book, > that will be your obvious first choice.? think about it.? you'll agree. This will ONLY do you any good if you manage to find that edition, out of all the other paper editions in the world. > so, michael, if you want to respond to this, answer that question. Sorry, but I anticipated ALL of these questions when I first started, and have answered, and will continue to answer, at length. Why can't you just propose your ideas as OPTIONS, not CARVED IN STONE? Michael From lee at novomail.net Mon Feb 22 14:45:43 2010 From: lee at novomail.net (Lee Passey) Date: Mon, 22 Feb 2010 15:45:43 -0700 Subject: [gutvol-d] Re: so what is so important about pagination? In-Reply-To: <4B802F64.3040909@teksavvy.com> References: <1bbf9.4e21db05.38b0c5a5@aol.com> <4B802F64.3040909@teksavvy.com> Message-ID: <4B830917.5010602@novomail.net> On 2/20/2010 11:52 AM, Gardner Buchanan wrote: > My question to the pagination-preservers is: what is the > difference? Both hyphenation, line-endings and pagination > are mainly artefacts of the physical medium -- one of width > and the other of height. Bowerbird wants to keep both; > I see no need to keep either. But what is the reasoning > behind keeping one (pagination) and not the other? As with most things, your position depends on your perspective. As a reader (consumer) of e-books, I want to get /all/ the production artifacts out of my way; a line should wrap wherever I would expect it to depending on the size of my viewport (screen), hyphenation should only occur between syllables at the right edge of the viewport, and page should end at the bottom of my viewport--no sooner, no later. Page numbers, if any, should reflect the number of /virtual/ pages there are in the book I'm reading; i.e. the number of viewports to complete the book. These page numbers should not be embedded in the text, but should be displayed somewhere else in the User Agent where I can refer to them if I want to, but otherwise they are inconspicuous. Of course, if I change fonts or the viewport size I would expect the page numbers to be updated to reflect that chante. BUT ... As a producer of e-books, it is my self-appointed task to create an e-book whose reading experience matches, a nearly as possible, a specific instance of a historical paper book. Clearly this doesn't mean that in the final product the page- or line-endings have to match the source, as that would in many cases lead to an awkward "ouija" board reading experience, but it does mean that I want to maintain markers throughout the e-document that can 1.) /create/ a view where the page- and line-endings match so I can do a side-by-side comparison of a page image with my electronic version, and 2.) lead me efficiently back to a particular page scan if there is any question about the correctness of the electronic edition. This apparent conflict between the two perspectives leads to two follow-on questions: 1.) where and how broad is the line between production and consumption?, and 2.) is it possible to create a single electronic document that can satisfy both needs? In the case of the PG/DP co-dependency, I think the line is clear and narrow: Distributed Proofreaders is /only/ a producer of electronic documents and its only consumer is Project Gutenberg. Project Gutenberg is /only/ a consumer of electronic texts, and while DP is its primary producer it is not the only one. According to Al Haines, one of PG's whitewashers, the PG 'errata' mechanism "is informal, at best, and there's no list of old submissions that would benefit from being re-done." Errata resolution at PG is handled via e-mail messages to a very small handful of whitewashers. According to Mr. Haines, "My PG priorities are my own productions first, followed by WWing, then Errata and Reposts." In yet another post, after detailing multiple problems with an old DP contribution he states: >> Is it worth it? Personally speaking, no. It's going to take hours to fix >> this text, time that I'd far rather spend on my own productions, but >> there's currently no mechanism except for the Whitewashers, a.k.a. >> Errata Team, to fix this kind of thing. (Probably simpler to just re-do >> this text from scratch, which is something *I'm* not about to do.) This is precisely the reason that DP puts such an emphasis on having a /completed/ text. Once an electronic document passes over from DP to PG there is almost /no/ chance that it will every be improved, revised, corrected. This is not to cast aspersions on the hard and dedicated work of the whitewashers, simply an acknowledgment of the fact that it is not a high priority for them and there is no formal mechanism to help it get done. Because Project Gutenberg is the /only/ consumer of Distributed Proofreader's production, preservation of line- and page-breaks should be of little importance in the current DP->PG work flow. /If/, on the other hand, what you're doing lies outside of the DP->PG work flow (as it appears your does), then the calculation changes. For example, what happens to the page scans from whence your text is derived? If those scans are not, and will not ever be, publicly available then encoding markers in the text that refer back to page scans and the original text layout may not be necessary or important. Likewise, if PG is your only distribution point then you will probably be the only one who will ever make changes, corrections or improvements to the text. If you expect that once you have completed a task and transferred responsibility to Project Gutenberg you are finished, perhaps even deleting the original scans and your intermediate work files (please don't do this; I'm sure that the Internet Archive would be willing to take them off your hands) then preservation of markers referring back to the original text are probably not necessary. By contrast, if you are preparing files for broader distribution than simply via Project Gutenberg, or if you anticipate that someday a work flow may develop either inside or outside of the DP->PG chain that will support continuous improvement of your original work, then I would think that creating and preserving text markers, including original page-breaks, line-breaks, and page numbers referencing the original scan set would be advisable. This is particularly true as it is always easier to preserve data, even that data of dubious value, than it is to try and recover data that has been lost or discarded. This leads us to the second question: is it possible to create a single electronic document which can satisfy the needs of both readers and producers? I believe that it is, but it requires the use of a markup language having at least the capability of marking some text as invisible, and a user agent that is capable of recognizing that markup and /not/ rendering it as indicated. I'm sure there are a number of markup languages that could satisfy this requirement, but I have chosen to use XHTML (with one small cheat). When ABBYY FineReader saves its OCR output in HTML format it has the option of placing a break (
) at the end of each line, and a horizontal rule (
) between each page (an alternative is to save each scanned page as a separate file, but I find that less convenient). I then wrote a short program (could probably be done just as easily with a perl script, or even sed) that replaced each
with an anchor tag indicating the page number (), and replaced each
with . Now is not a valid HTML element (hence the cheat), but I know of no user agent that will fail to render an HTML file just because it has an invalid element in it. FineReader is quite good at recognizing when line-ending hyphenation is due to splitting long words or when it is required as a part of a compound word. In the first case, when it saves using line breaks it saves the line-ending hyphenation either as a hard hyphen or a soft-hyphen. When soft hyphens are replaced by ­ (and the following white space is removed) you have recorded a line-ending hyphen which will not be displayed (although different user agents sometimes do things differently). Originally I was in the "collapse page- and line-break" camp, but because I never submit my texts to PG, and I have hope that someday some sort of continuous improvement process may evolve (and because maintaining the data is cheap and simple) I'm moving into the "preserve everything" camp. From azkar0 at gmail.com Mon Feb 22 15:02:41 2010 From: azkar0 at gmail.com (Scott Olson) Date: Mon, 22 Feb 2010 16:02:41 -0700 Subject: [gutvol-d] Re: so what is so important about pagination? In-Reply-To: <4B830917.5010602@novomail.net> References: <1bbf9.4e21db05.38b0c5a5@aol.com> <4B802F64.3040909@teksavvy.com> <4B830917.5010602@novomail.net> Message-ID: <2362473e1002221502k7a23d04et90c268e3fa3865a0@mail.gmail.com> On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey wrote: > When ABBYY FineReader saves its OCR output in HTML format it has the option > of placing a break (
) at the end of each line, and a horizontal rule > (
) between each page (an alternative is to save each scanned page as a > separate file, but I find that less convenient). I then wrote a short > program (could probably be done just as easily with a perl script, or even > sed) that replaced each
with an anchor tag indicating the page number > (
), and replaced each
with . Now is not a > valid HTML element (hence the cheat), but I know of no user agent that will > fail to render an HTML file just because it has an invalid element in it. Since the user agent will take care of rewrapping, you could just leave the linebreaks where they are. If you really want to have them encoded, I'd opt for some CSS. br.lb {display: none} in your