From nwolcott2ster at gmail.com Tue Apr 1 08:55:59 2008 From: nwolcott2ster at gmail.com (Norm Wolcott) Date: Tue, 1 Apr 2008 10:55:59 -0500 Subject: [gutvol-d] stopping perpetuity -- harder than it looks! References: Message-ID: <001e01c89410$f33e67c0$660fa8c0@atlanticbb.net> Isn't this part of DP striving for perfection? and for what reaason. PG started to make computer text files available to all. The first ones were typed in. Errors? sure. But the product was perfectly useful. Page imaages did not exist. Now the necessityof maintaining end of page hyphens etc escapes me. If you want English Literature from 1500 to 1950 you can subscribe to Chadwick-Healy for a few thousand a year and have it all in page images, hyphens and all. If you don't want to to pay you can go to Internet Archive, and get all the page images you want, although the selection right now is not as good as PG. With their new book scanner that works on wind the drudge is taken out of the scanning process, and it is only a matter of time until most of the UCAL library will be available. DP would be well advised to bundle up all their "perfect scans" and ship them to IA for preservation and forget about hyphens and ellipses. Except that the quality is probably too low at 200 dpi. This is a rant-and-rave site so I will rant and rave too! Love it! nwolcott2 at post.harvard.edu ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Tuesday, April 01, 2008 12:18 AM Subject: Re: [gutvol-d] stopping perpetuity -- harder than it looks! take a look at the project page for iteration#6 of "planet strappers": > http://www.pgdp.net/c/project.php?id=projectID47dfd4f82feae it's chugging along, and about half of the pages done have a "diff"... that's right, you heard me correctly, about _half_ the pages! :+) "but how can that be?", you might be asking. "already these pages went through 5 rounds, and there are _still_ changes being made?" yep. sure are. not _corrections_, mind you. just "changes"... meaningless changes... every last one of them meaningless... most of them having to do with ellipses. and these changes appear to have been done by new proofers (who else would tackle a project that has been in the rounds a half-dozen times?) who don't know the rules. (for example they're replacing typos, ones where a note had been left.) heck, one (or more) is even putting spaces _between_ the ellipse dots! right after carlo, in a forum thread, said he had never seen that before. (but -- amazingly -- in strict accordance with the p.g. f.a.q. on ellipses, which has to be one of the most brain-dead p.g. rules devised thus far. spaces between the dots of an ellipses will wreak havoc on any rewrap.) throw in a couple of runarounds on end-line hyphenates as well, with some people inserting hyphens or asterisks, and others removing 'em, and you've got one tasty "error-injection" stew boiling in your pot... this is crazy. i mean, it's an excellent demonstration of what will happen when you have "rules" that are interpreted and reinterpreted differently all the time, and confusing to boot... there are currently _several_ threads running in the d.p. forums dealing with ellipse confusion: > http://www.pgdp.net/phpBB2/viewtopic.php?t=31237 > http://www.pgdp.net/phpBB2/viewtopic.php?t=30521 further, the proofers doing these changes don't seem to realize five rounds of proofers have checked these pages before them. ("and just think, every _one_ of them missed _all_ these ellipses... on one page after the next... really very surprising, that, isn't it?") _hours_ of proofer time were spent bringing you this conclusion. just on this iteration. so far. and it ain't done. but what the heck, it's just _proofer_time_... and that ain't worth as much as peanuts... *** oh, just in case you're wondering... this iteration#6 did _not_ catch the one remaining error, on p#33. we'll have to wait for iteration#7. -bowerbird p.s. however, iteration#6 _did_ find a p-book error that everyone thus far missed... the word "inconveniencies" for "inconveniences". what a shocker! how did everyone else manage to miss that so far? oh, ok, you big spoilsport, dictionary says either one is acceptable. but still, give that proofer a blue ribbon for great eyes trying hard! ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&ncid=aolhom00030000000001) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/286fb924/attachment.htm From hyphen at hyphenologist.co.uk Tue Apr 1 09:04:33 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Tue, 1 Apr 2008 17:04:33 +0100 Subject: [gutvol-d] stopping perpetuity -- harder than it looks! In-Reply-To: References: Message-ID: <001501c89412$15431860$3fc94920$@co.uk> Bowerbird at aol.com Wrote >take a look at the project page for iteration#6 of "planet strappers": >> http://www.pgdp.net/c/project.php?id=projectID47dfd4f82feae >it's chugging along, and about half of the pages done have a "diff"... >that's right, you heard me correctly, about _half_ the pages! :+) OK you have convinced me. I will avoid the pedants at DP and bash on doing things myself. Dave F -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/8f122751/attachment.htm From piggy at netronome.com Tue Apr 1 09:58:56 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Tue, 01 Apr 2008 12:58:56 -0400 Subject: [gutvol-d] stopping perpetuity -- harder than it looks! In-Reply-To: <001501c89412$15431860$3fc94920$@co.uk> References: <001501c89412$15431860$3fc94920$@co.uk> Message-ID: <47F269D0.9060303@netronome.com> Dave Fawthrop wrote: > > > > * * > > Bowerbird at aol.com Wrote > > >take a look at the project page for iteration#6 of "planet strappers": > >> http://www.pgdp.net/c/project.php?id=projectID47dfd4f82feae > > >it's chugging along, and about half of the pages done have a "diff"... > > >that's right, you heard me correctly, about _half_ the pages! :+) > > > > OK you have convinced me. > > > > I will avoid the pedants at DP and bash on doing things myself. > > > > Dave F > Please reconsider. Pedantry gets an inappropriate bad rap :=). One point of this experiment is to expose this kind of noise floor problem. This is an _experiment_, specifically designed to learn things about really large numbers of proofing rounds. Since starting this study of actual data, I've been hearing a lot less of the "perfectionist v.s. good enough" debate. I certainly recall a certain fowl claiming that we just had to proof pages until there were no more changes. That was later revised to advice that we just need to proof until the number of changes was below some threshold and no real suggestion on how to choose that threshold. I think we've got a concenus that not every book needs or deserves the same amount of work. We're developing the tools to apply the appropriate amount of effort to each book. I have certainly had my frustrations with PGDP, but I still credit the PGDP community with helping me learn a lot about preparing ebooks. There are a lot people who are eager to help for whatever project you may have in mind. From nwolcott2ster at gmail.com Tue Apr 1 11:10:36 2008 From: nwolcott2ster at gmail.com (Norm Wolcott) Date: Tue, 1 Apr 2008 13:10:36 -0500 Subject: [gutvol-d] stopping perpetuity -- harder than it looks! References: Message-ID: <004901c89423$c59c9a40$660fa8c0@atlanticbb.net> And for "inconveniencies" that would have been caught by MSWORD's spelll-check!!! No 6 rounds required nwolcott2 at post.harvard.edu ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Tuesday, April 01, 2008 12:18 AM Subject: Re: [gutvol-d] stopping perpetuity -- harder than it looks! take a look at the project page for iteration#6 of "planet strappers": > http://www.pgdp.net/c/project.php?id=projectID47dfd4f82feae it's chugging along, and about half of the pages done have a "diff"... that's right, you heard me correctly, about _half_ the pages! :+) "but how can that be?", you might be asking. "already these pages went through 5 rounds, and there are _still_ changes being made?" yep. sure are. not _corrections_, mind you. just "changes"... meaningless changes... every last one of them meaningless... most of them having to do with ellipses. and these changes appear to have been done by new proofers (who else would tackle a project that has been in the rounds a half-dozen times?) who don't know the rules. (for example they're replacing typos, ones where a note had been left.) heck, one (or more) is even putting spaces _between_ the ellipse dots! right after carlo, in a forum thread, said he had never seen that before. (but -- amazingly -- in strict accordance with the p.g. f.a.q. on ellipses, which has to be one of the most brain-dead p.g. rules devised thus far. spaces between the dots of an ellipses will wreak havoc on any rewrap.) throw in a couple of runarounds on end-line hyphenates as well, with some people inserting hyphens or asterisks, and others removing 'em, and you've got one tasty "error-injection" stew boiling in your pot... this is crazy. i mean, it's an excellent demonstration of what will happen when you have "rules" that are interpreted and reinterpreted differently all the time, and confusing to boot... there are currently _several_ threads running in the d.p. forums dealing with ellipse confusion: > http://www.pgdp.net/phpBB2/viewtopic.php?t=31237 > http://www.pgdp.net/phpBB2/viewtopic.php?t=30521 further, the proofers doing these changes don't seem to realize five rounds of proofers have checked these pages before them. ("and just think, every _one_ of them missed _all_ these ellipses... on one page after the next... really very surprising, that, isn't it?") _hours_ of proofer time were spent bringing you this conclusion. just on this iteration. so far. and it ain't done. but what the heck, it's just _proofer_time_... and that ain't worth as much as peanuts... *** oh, just in case you're wondering... this iteration#6 did _not_ catch the one remaining error, on p#33. we'll have to wait for iteration#7. -bowerbird p.s. however, iteration#6 _did_ find a p-book error that everyone thus far missed... the word "inconveniencies" for "inconveniences". what a shocker! how did everyone else manage to miss that so far? oh, ok, you big spoilsport, dictionary says either one is acceptable. but still, give that proofer a blue ribbon for great eyes trying hard! ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&ncid=aolhom00030000000001) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/bab493e7/attachment.htm From Bowerbird at aol.com Tue Apr 1 11:18:04 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 1 Apr 2008 14:18:04 EDT Subject: [gutvol-d] stopping perpetuity -- harder than it looks! Message-ID: piggy said: > I certainly recall a certain fowl claiming that we just had to > proof pages until there were no more changes. oh please. don't be an idiot, ok? it's hard to respect idiots... you _do_ have to proof pages until there are no more changes. but you also have to have reasonable and consistent policies, so your proofers don't teeter-totter changes back and forth... and if you never heard me "claiming" that d.p. needs to have reasonable and consistent policies, you just weren't listening. > That was later revised to advice that we just need to proof > until the number of changes was below some threshold > and no real suggestion on how to choose that threshold. and here you've said _two_ stupid things, in one mere paragraph. first, i've _never_ "revised" my advice. it's still the same as it always. you need to proof pages until there are no more changes. period. not "a small number of changes below some threshold", with that "threshold" being vague. until there's _no_ more changes. zero. further, obtaining _zero_ changes on a page in one proofing does _not_ mean that the page is perfect. it means there is a _chance_ that the page is perfect, with that chance being higher when you have a greater accuracy-rate from your proofers, lower when not. i've said you need _two_consecutive_ "no diff" proofings on a page in order to be _sufficiently_ confident that there are no errors left... at least that's what gives me a _comfortable_ level of _confidence_. if you're less particular, you might settle on one "no diff" proofing. and if you're more particular, you might require _three_ or _four_. *** i'd say you've confused yourself to the point that you can no longer even _recognize_ the simple answer i have given to you all along... but i'm confident the vast majority of people reading along got it. so please don't make yourself look stupid by misquoting it so badly. -bowerbird p.s. as for running an "experiment" to "prove" that shoddy policies will lead to inconsistent results, i don't even think it merits the effort. ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/2ebb9ff3/attachment.htm From Catenacci at Ieee.Org Tue Apr 1 11:26:50 2008 From: Catenacci at Ieee.Org (Onorio Catenacci) Date: Tue, 1 Apr 2008 14:26:50 -0400 Subject: [gutvol-d] stopping perpetuity -- harder than it looks! In-Reply-To: <004901c89423$c59c9a40$660fa8c0@atlanticbb.net> References: <004901c89423$c59c9a40$660fa8c0@atlanticbb.net> Message-ID: On Tue, Apr 1, 2008 at 2:10 PM, Norm Wolcott wrote: > > > And for "inconveniencies" that would have been caught by MSWORD's > spelll-check!!! No 6 rounds required > nwolcott2 at post.harvard.edu > Which version of Word? Word 2003 lets this go by as being spelled correctly. From BB's description this sounds like a grammatical error--not a spelling mistake. Like their vs. they're vs. there. -- Onorio Catenacci III From Bowerbird at aol.com Tue Apr 1 11:29:40 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 1 Apr 2008 14:29:40 EDT Subject: [gutvol-d] stopping perpetuity -- harder than it looks! Message-ID: norm said: > And for "inconveniencies" that would have been caught > by MSWORD's spelll-check!!! No 6 rounds required well, don't forget that the p1 iterations don't have the full benefits of the d.p. spellchecker -- named "wordcheck" -- because the "good words" list was not maintained for them. that means that words which are _common_ in the book but _not_ in the dictionary -- like the names of the characters -- are flagged _repeatedly_, which is so distracting that i'd guess most of these p1 iteration proofers didn't even bother to use it. but yeah, you'd think that the p3 proofers, who are _required_ by d.p. rules to use wordcheck, would have caught that error... however, (1) it _was_ spelled that way in the p-book, and there is no requirement to fix p-book errors, and (2) the dictionary says that either form is acceptable. so really, it is a non-issue... i just thought it was interesting that a sharp-eyed proofer who was faced with clean page after clean page _still_ caught that... when otherwise they might be forgiven for falling asleep. and also, it puts the lie, once again, to the notion that there are "expert proofers" who can catch errors that "normal" ones can't. -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/c5a43e37/attachment.htm From Bowerbird at aol.com Tue Apr 1 11:39:46 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 1 Apr 2008 14:39:46 EDT Subject: [gutvol-d] stopping perpetuity -- harder than it looks! Message-ID: robert said: > From BB's description this sounds like a grammatical error > --not a spelling mistake.? no, it was neither a grammatical error nor a spelling mistake. although "inconveniencies" is flagged by my spellchecker too, my p-dictionary (random house webster's college dictionary) gives it as a legitimate spelling, defined as "inconveniences"... when i said the p3 proofers required to use wordcheck _should_ have caught this, it's because i assumed the word was flagged... but if it wasn't flagged, there is no reason they should catch it, since -- as i've said before -- it _was_ that way in the p-book... but yes, this is the kind of thing that will hang you up badly when you strive for perfection, if you don't keep it in check... it's funny what _can_ show up in the _8th_ round of proofing. (6 rounds by p1 proofers, and 1 each by p2 and p3 proofers.) -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/2b50c256/attachment.htm From hart at pglaf.org Tue Apr 1 12:15:54 2008 From: hart at pglaf.org (Michael Hart) Date: Tue, 1 Apr 2008 12:15:54 -0700 (PDT) Subject: [gutvol-d] stopping perpetuity -- harder than it looks! In-Reply-To: References: Message-ID: I've also seen errors that crept in AFTER all major proofreading, such as errors in headers, footers, footnotes, introductions, and all that other stuff. In the end, the final proofings should include one by someone who was NOT involved in ANY of the previous proofreadings. . . . Someone with a fresh eye for the errors mentioned above. . . . PS. Has anyone heard from Jon Noring? This is the first year of our "March Madness" when he was not one of the majoy players for at least five years. I sent a note to the most recent posting he sent here, but didn't get any response, and it's been at least a week. Please advise, Michael S. Hart Founder Project Gutenberg On Tue, 1 Apr 2008, Bowerbird at aol.com wrote: > robert said: >> From BB's description this sounds like a grammatical error >> --not a spelling mistake.? > > no, it was neither a grammatical error nor a spelling mistake. > > although "inconveniencies" is flagged by my spellchecker too, > my p-dictionary (random house webster's college dictionary) > gives it as a legitimate spelling, defined as "inconveniences"... > > when i said the p3 proofers required to use wordcheck _should_ > have caught this, it's because i assumed the word was flagged... > but if it wasn't flagged, there is no reason they should catch it, > since -- as i've said before -- it _was_ that way in the p-book... > > but yes, this is the kind of thing that will hang you up badly > when you strive for perfection, if you don't keep it in check... > > it's funny what _can_ show up in the _8th_ round of proofing. > (6 rounds by p1 proofers, and 1 each by p2 and p3 proofers.) > > -bowerbird > > > > ************** > Create a Home Theater Like the Pros. Watch the video on AOL > Home. > (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& > ncid=aolhom00030000000001) > From Bowerbird at aol.com Tue Apr 1 12:22:41 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 1 Apr 2008 15:22:41 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 01 Message-ID: so let's go to work on the _second_ parallel experiment... this time, we'll cut right to the interesting stuff. in this book, with 7,654 lines, and the same shoddy scans, and the same miserable tesseract o.c.r., there were only a mere _378_ of the p1 lines changed by p2 and/or p3... > http://z-m-l.com/go/chris/chris-378-changes.html as before, p1 made _thousands_ of changes to the text, taking the entire book to such a state of perfection that just 378 lines need to be changed later by p2 and/or p3. this part is incontestable. just compare p1 and p3 text. *** further... i can tell just by looking at these lines that the results will be _substantially_similar_ to the first book in all regards... that is, even of that small number of 378 lines changed, _most_ of the changes (1) were due to the bad scans, or (2) were due to bad o.c.r., or (3) were due to d.p. policies which need to be improved, or (4) could've been fixed by reasonable use of an average clean-up tool, not humans. so once we realize the few _actual_ flaws in the p1 output, we'll be _absolutely_amazed_ by their high level of quality. the p1 proofers rock... and successive p1 iterations roll... *** some people might see no need to repeat this exercise... considering the d.p. power of complete and total _denial_, however, it's probably necessary to do the drill once again. because if i don't, they'll keep allowing half-assed scanning, and content providers who use _tesseract_ to do their o.c.r. and fail to do any preprocessing on the books they submit... and _somebody_ has to document this massive, unethical waste of volunteer time and energy, or it will just continue. -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/c50e6b47/attachment.htm From Bowerbird at aol.com Tue Apr 1 12:53:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 1 Apr 2008 15:53:03 EDT Subject: [gutvol-d] stopping perpetuity -- harder than it looks! Message-ID: michael- jon noring is fine, as far as i can tell. and i, for one, haven't missed him here... :+) i did notice his absence (what a relief!), but didn't miss him, know what i mean?... ;+) but i've seen him elsewhere... he's helping a friend self-publish a book, and has made a couple posts to a listserve asking questions about that whole process. evidently, that has made him think a bit, and _re-think_ his previous commitment to have major publishers endorse his various efforts... he's recognized p.o.d. will change everything. so he recently posted a message to his listserve arguing that we'll "publish first and filter later" in the future, and publishers will have to adapt from "gatekeepers" into "marketeers" instead... there was also an entry on teleread to that effect. which reminds me... for anybody _waiting_ for my spring equinox "state of teleread" message... :+) sorry for the delay, but there's not much to report. i haven't been reading teleread much, not even the comments, which i assume are still mildly interesting. not enough time... plus, i've been reading david moynihan instead... > http://www.munseys.com/technosnarl/ i find david m. much more entertaining than david r. plus he's brief. and doesn't repeat himself so much. oh yeah, he makes more sense, always a good thing. also, moynihan calls rothman an "idiot" straight out, which i find as straightforwardness that's delightful. more importantly, though, moynihan (who has always been the person examining the e-book marketplace more closely than anyone, so he knows the numbers) is convinced that amazon has changed the game with the kindle, seizing _half_the_market_ in just 3 months -- according to the best data that he can ferret out -- which is even more remarkable considering that they haven't even been able to keep up with the demand for the hardware units. when they can, it's all over... moynihan takes this as a stake in the heart of .epub. since i trust moynihan's knowledge of the numbers, rothman's spin machine seems _irrelevant_ anymore, so i won't even bother to set his confusions straight... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/bb23931d/attachment.htm From Bowerbird at aol.com Wed Apr 2 02:22:02 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Apr 2008 05:22:02 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 02 Message-ID: we're on book 2 of the parallel proofing experiment at d.p. -- christopher and the clockmakers -- so let's dive right in... first, we showed that p2 and p3 only changed _378_ lines. now let's see _how_many_ of those 378 lines we could have fixed automatically out of p1 using a decent clean-up tool... i'm talking maybe a dozen reg-ex global changes, all of 'em well-known to d.p. already, contained in gutcheck or guiguts or guiprep or roger frank's tools or in dkretz's reg-ex list... there's no sleight of hand here, just plain old elbow-grease... so how many are fixed? wow. 239 of them. not bad. that leaves just _139_ lines short of perfection straight from p1, using preprocessing. here are those 139: > http://z-m-l.com/go/chris/chris-139-changes.html remember, this book has some 7,654 non-blank lines... so even though the o.c.r. was totally crappy, because of some dicey scans and the use of a beta o.c.r. application, causing poor p1 _thousands_of_unnecessary_corrections_, p1-plus-auto-cleanup took over 7,500 lines to perfection... and we haven't even parceled out the effects of the o.c.r. and looking at the large number of "bad" lines that were caused by missing em-dashes, which was a hallmark of the tesseract problems in the last book, i think the total of 139 lines will drop substantially after we count that... plus the plain old scannos that abbyy wouldn't have made, and the words that got cut off by that "gutter-noise", and... once again, p1 rocks. it's getting kinda boring, isn't it? -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080402/8628e927/attachment.htm From Bowerbird at aol.com Wed Apr 2 09:59:48 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Apr 2008 12:59:48 EDT Subject: [gutvol-d] from the book of matthew Message-ID: um, piggy? you know how i said you didn't have to use my name or give me credit on the d.p. wiki, because "facts is facts", remember that? well, if you _are_ going to use my name, then please _do_ make sure that you get all the facts straight, ok? you have me saying that, on the "planet strappers" test, p1 proofers made some 200+ corrections of "real" errors. that's not really what i said. and it's not really what happened. no, the p1 proofers had to make some _2,200_ corrections. and when i said that only _two_hundred_ were "real" errors, that was to contrast them with the _two_thousand_ errors that _you_ "injected" into the text, with your incompetence. you seem to want to "forget" that you injected those errors, but my entire point was to _highlight_ the severe contrast between the few _o.c.r._ errors versus your many _incompetence_ errors. if you're going to use the facts i uncovered, please use them all. your selective attention shines the spotlight in the wrong place. from the book of matthew: > 7:3 And why beholdest thou the mote that is in thy brother's eye, > but considerest not the beam that is in thine own eye? > 7:4 Or how wilt thou say to thy brother, Let me pull out the mote > out of thine eye; and, behold, a beam is in thine own eye? > 7:5 Thou hypocrite, first cast out the beam out of thine own eye; > and then shalt thou see clearly to cast out the mote out of > thy brother's eye. first cast the beam out of thine own eye... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080402/dd21f9e5/attachment.htm From Bowerbird at aol.com Wed Apr 2 11:15:29 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Apr 2008 14:15:29 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 03 Message-ID: sometimes, when you just look at all the lists of errors, it's easy to lose perspective on the awesome job p1 did. so take a gander at this file: > http://z-m-l.com/go/chris/chris-139-fullcheck.html the lines which o.c.r. and p1 got perfect are listed in black, while lines corrected by p2 and/or p3 are shown in color... the paucity of colored lines shows how well p1 proofed this. remember how much of this o.c.r. was absolutely atrocious... or, in case you haven't seen it yet, here's that o.c.r.: > http://z-m-l.com/go/chris/chris-ocr.txt p1 took that garbage and turned it into a near-perfect book. *** also... some of you might wonder how you'd do a "reconciliation" between two different rounds. so i demonstrated it here... you'll notice that each line which was different in the rounds has a checkbox at the end. so a person could step through these differences, checkboxing the version which is correct. (repeat a search for "diff>" to bring each difference in view.) of course, in this, the p3 version is gonna be the right one -- at least we'd assume so! -- so this is unnecessary here... but in _parallel_ rounds, rather than serial, you have to do a reconciliation of the differences, and this is how you do that. as you can see, in the overwhelming majority of the cases, you can resolve the discrepancy without viewing the scan. (but an ability to summon the scan will be included later, so when it _is_ necessary to view it, you can do it easily...) -bowerbird p.s. the labeling was incorrect on this file posted earlier: > http://z-m-l.com/go/chris/chris-139-changes.html i've changed it from the incorrect tess/abby to the correct p1/p3. ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080402/4a5d1f6b/attachment-0001.htm From Bowerbird at aol.com Wed Apr 2 15:20:30 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Apr 2008 18:20:30 EDT Subject: [gutvol-d] the stakes Message-ID: here's a message for d.p., especially its "leadership"... i hope that i've made clear that i believe a moral issue is involved here, namely your wasting of the time and energy of your volunteer digitizers. as i've said, i will take this moral issue to the public, if that's necessary... thus far, i've tried to talk to you on your own forums, but i was banned. and i've tried to talk to you here, but you've stuck your head in the sand. this leaves me no choice but to go wide. i really don't think that's what you want me to do, though... as i've shown, i can muster _solid_evidence_ that is _very_ damaging. do you really want the public at large to be staring at such evidence? imagine the attention that will be laser-focused on your inefficiency and incompetence if a slashdot or a boingboing gets a hold of this... the time is now, for you to act, or for you to regret not acting... the choice is up to you. i'll give you two weeks to think about it, and get back to me... that's april 15th, so i don't think you'll have a problem remembering it. if you don't have a solid plan in place to make immediate improvements, i will put into place a solid plan to bring public scrutiny to bear on you... in a _big_ way. i don't _want_ to do that. but i will, if i have to, to protect the proofers. these are the stakes. make no mistake, the stakes are high. the decision is in your hands. do the right thing. because if you won't, i will... that is all. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080402/66c793f8/attachment.htm From Bowerbird at aol.com Thu Apr 3 00:26:48 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 3 Apr 2008 03:26:48 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 04 Message-ID: ok, here's an easy one. i put the p3 output from "christopher and the clockmakers" into z.m.l. format, to be able to work it with more fluency... > http://z-m-l.com/go/chris/chris.zml *** notice that z.m.l. is similar to the o.c.r. output structure, except instead of o.c.r.-type "------file: 123.png-------" separators, z.m.l. has a pair of lines, the top one ending the previous page with a double-bracketing of the _pagenumber_ on that page, and the bottom line in the pair beginning the next page with a double-braced explication of the filename of the page's image. *** of course, once it was in z.m.l. format, i could auto-generate the .html files to administer "continuous proofreading", thus: > http://z-m-l.com/go/chris/chrisc001.html or, if you want to jump to my standard page, page 123: > http://z-m-l.com/go/chris/chrisp123.htm and i remind you of the convenient 2-up facing-pages mode: > http://z-m-l.com/go/chris/chrisp123w.htm as opposed to the strait-jacket of one-page-at-a-time, a la the d.p. interface, here the book is at your disposal. you can skim forward and back, or jump from chapter to chapter, or dial a particular page from the "contents" file. (or by typing its pagenumber in the browser address-bar.) -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080403/e8cd7966/attachment.htm From legutierr at hotmail.com Fri Apr 4 09:08:25 2008 From: legutierr at hotmail.com (L Gutierr) Date: Fri, 4 Apr 2008 12:08:25 -0400 Subject: [gutvol-d] PG ebooks on OLPC XO Message-ID: Hi! I am planning on writing a Python program to read PG ebooks, for eventual use on the OLPC XO laptop, and for use on Macintosh computers (two platforms that use Python natively for GUI programs). What I want the program to do is to read PG ebooks in ascii form, interpret them semantically and then display semantic elements according to either the user's preferences, or the requirements of the platform. What I mean by semantic interpretation is that the program would read the book and know what the parts are conceptually (chapter, author, title, description, illustration, footnote), not just what they might look like in the original, visual-appearance-wise. This will allow the program to include GUI elements that are generally not available to web-browsers and other text readers, including most HTML representations: for example, it would allow margin notes to appear at the margin, footnotes to appear at the bottom of the page that the user sees (regardless of font size) and things like that. Anyone want to help me do this, or have suggestions? The project would be open-source (probably BSD-type license or public domain, not sure yet). The first step in this project for me will be to define a grammar that can be used to build a parser (I don't want to do this heuristically). I'm thinking that a good starting point will be to create a formal grammar, if one does not already exist, that describes the formatting standards on the pgdp.net website (http://www.pgdp.net/c/faq/document.php). From reading the pgdp.net website, it seems that a good number of PG books are being submitted now with that format. Does anyone know if other well-defined formatting standards exist within the PG library? Does anyone want to guess about the percentage of PG ebooks that might validly comply with some kind of formal grammar that could be precisely defined (I'm thinking that several grammars could be used if necessary)? I'm fully aware that this formatting issue has come up before in these mailing lists, and that, as a whole, the PG library is a bit of a mess formatting-wise, but I want to give it a shot anyway. Any thoughts out there? Anyone want to take the plunge with me? -legutierr _________________________________________________________________ Going green? See the top 12 foods to eat organic. http://green.msn.com/galleries/photos/photos.aspx?gid=164&ocid=T003MSN51N1653A From Bowerbird at aol.com Fri Apr 4 12:38:56 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Apr 2008 15:38:56 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 05 Message-ID: ok, here's another easy one. in "christopher and the clockmakers", used for the second parallel proofing experiment at d.p., p1 proofers changed 3,724 lines from the o.c.r. > http://z-m-l.com/go/chris/chris-3724changes-p1.html three thousand seven hundred and twenty four! my word! contrast that with the _378_ lines changed in p2 and/or p3: > http://z-m-l.com/go/chris/chris-378-changes.html the p1 proofers work _extremely_ hard, and do a darn fine job. in this book, like in a lot of books, they took very crappy input, and turned it into output that was nearly perfect. right on, p1! *** so once again, we get the pattern i've discussed all along, the pattern that seems to capture a "common-sense" take, which is that p1 fixes most of the errors, p2 gets most of the remaining ones, and p3 comes in and does clean-up. again, this is the pattern you get on page after page, in book after book, day after day, over in d.p.-land... why there is _any_ lack of awareness or comprehension of this pattern is a total and complete mystery to me... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080404/8c4648a7/attachment.htm From nwolcott2ster at gmail.com Fri Apr 4 16:45:35 2008 From: nwolcott2ster at gmail.com (Norm Wolcott) Date: Fri, 4 Apr 2008 18:45:35 -0500 Subject: [gutvol-d] Most downloaded book Message-ID: <000a01c896ae$053b2fe0$650fa8c0@atlanticbb.net> Interesting the most downloaded book from the Internet Archive (which includes PG) is Dudeny's Amusements in Mathematics, 1890. Not often a Victorian book hits the big time. The only reason I can think of is that a lot of high school kids are having fun trying to solve the mathematical puzzles in the book. And they are not trivial. Of course the downloads could be coming from overseas too, maybe more math interest there. This book has been consistently at the top of the charts since it appeared a few months ago. nwolcott2 at post.harvard.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080404/b7bfb0eb/attachment.htm From hart at pglaf.org Fri Apr 4 20:36:38 2008 From: hart at pglaf.org (Michael Hart) Date: Fri, 4 Apr 2008 20:36:38 -0700 (PDT) Subject: [gutvol-d] Most downloaded book In-Reply-To: <000a01c896ae$053b2fe0$650fa8c0@atlanticbb.net> References: <000a01c896ae$053b2fe0$650fa8c0@atlanticbb.net> Message-ID: As far as I can tell, most of our downloads ARE from the other side of the northern hemisphere, not U.S. I wonder if we can narrow down the epicenter??? Michael On Fri, 4 Apr 2008, Norm Wolcott wrote: > Interesting the most downloaded book from the Internet Archive (which includes PG) is Dudeny's Amusements in Mathematics, 1890. Not often a Victorian book hits the big time. The only reason I can think of is that a lot of high school kids are having fun trying to solve the mathematical puzzles in the book. And they are not trivial. Of course the downloads could be coming from overseas too, maybe more math interest there. This book has been consistently at the top of the charts since it appeared a few months ago. > nwolcott2 at post.harvard.edu From ralf at ark.in-berlin.de Sat Apr 5 03:28:50 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Sat, 5 Apr 2008 12:28:50 +0200 Subject: [gutvol-d] PG ebooks on OLPC XO In-Reply-To: References: Message-ID: <20080405102850.GB6387@ark.in-berlin.de> > Does anyone know if other well-defined formatting standards exist within the PG library? Does anyone want to guess about the percentage of PG ebooks that might validly comply with some kind of formal grammar that could be precisely defined (I'm thinking that several grammars could be used if necessary)? The easiest way to find that out is to go to Advanced Search, and look at the available file formats. For your purpose, note especially La/TeX and TEI. Oh, and if you try to search for them, give a space character as author, else it won't work. You can ask me about TEI issues. For example, we have now several scholarly works with their references encoded such that a bibliography can be extracted and converted to any format. Regards, ralf From Bowerbird at aol.com Mon Apr 7 02:45:51 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Apr 2008 05:45:51 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 06 Message-ID: ok, lately we've been examining fairly closely "christopher and the clockmakers", used for the second d.p. parallel proofing experiment. this book contains roughly 8,000 lines in it... we found that the p1 proofers had to modify 3,724 of the lines, which is quite some work... i'll give an overview of the changes in this post; my next message will provide the justifying data. tesseract is a bad o.c.r. app. when i used abbyy, nearly 800 more lines were recognized correctly. many of the scans were bad, but 14 were awful. these 14 scans alone contributed over 95 errors. i'd guess "partly" bad scans added another 100+. so we're up to about 1,000 changed lines being directly attributable to an inept content provider -- via executing the scans, and choice of o.c.r. -- rather than "inevitable" results of the o.c.r. process. 1000. that's over 1/4 of the 3,724 changes made by p1. further, rejoining 950 end-of-line hyphenates caused about _1,900_ additional lines to have to be changed. ellipses -- 100 of them -- likely caused more changes, as did a wide assortment of other policy-related factors. so right there we've got roughly _2,000_ changes that were unnecessarily caused by idiotic d.p. policies that force proofers to do work computers could do better... 2000. which is over 1/2 of the 3,724 changes made by p1. altogether, 3,000 of the original 3,724 changed lines can be attributed to _avoidable_ aspects of practice and policy. in other words, 80% of the changes didn't need to be done. energy making those changes was wasted, plain and simple. wasting 80% of resources donated to you is _unforgivable_... and this figure is in line with the other books i've analyzed; even if other books don't have the same problems as here, many have their own problems causing unnecessary work. it is the rare d.p. project which doesn't reflect _some_ flaw. d.p. needs to _examine_ their workflow, and _improve_ it... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080407/095160ca/attachment.htm From Catenacci at Ieee.Org Mon Apr 7 10:39:03 2008 From: Catenacci at Ieee.Org (Onorio Catenacci) Date: Mon, 7 Apr 2008 13:39:03 -0400 Subject: [gutvol-d] Copyright Status of Sales Literature Message-ID: Hi all, I have a booklet that may qualify as sales literature. I would normally think that any booklet or publication that does not have a copyright notice and published before 1978 would fall into the public domain (Rule 5) but Rule 3 also seems that it may apply in this case. The problem is that this particular piece of sales literature has no detectable publication date or copyright notice. Text in the booklet itself would seem to indicate that it was published sometime later than 1971 but beyond that I cannot find a definite date. Assuming I could prove publication before 1978 would Rule 5 apply even if it is sales literature? Any suggestions about tracking down the copyright status of this booklet would be welcome. I tried contacting the company that originally published this booklet figuring that their explicit permission would remove all doubt but I cannot seem to get hold of the right legal people within their organization to answer this question for me. -- Onorio Catenacci III From Catenacci at Ieee.Org Mon Apr 7 10:41:01 2008 From: Catenacci at Ieee.Org (Onorio Catenacci) Date: Mon, 7 Apr 2008 13:41:01 -0400 Subject: [gutvol-d] Copyright Status of Sales Literature Message-ID: Hi all, I'm sorry--my last e-mail should have been preceded by the statement "If anyone can help me with this, I would greatly appreciate it". I apologize if my first e-mail sounded a bit demanding. :-) -- Onorio Catenacci III From Bowerbird at aol.com Mon Apr 7 11:14:11 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Apr 2008 14:14:11 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 07 Message-ID: hope you had a good weekend. now back to work... in "christopher and the clockmakers", used for the second parallel proofing experiment at d.p., p1 proofers changed 3,724 lines from the o.c.r.: > http://z-m-l.com/go/chris/chris-3724changes-p1.html that's certainly a lot of changes. p1 did much work. however, let's break that number down a little, ok, and find out the _reasons_ for all of those changes. ok, first of all, we know that tesseract is a dog... when i did the o.c.r. using abbyy finereader instead, there were only 2,944 lines different from p1 output: > http://z-m-l.com/go/chris/chris-2944abbyp1changes.html this means that abby would have _saved_ p1 from having to make some _800_ of their changes. wow. (it's actually much more than that, because we have assumed here that the p1 output was correct... but we'll live with this underestimate for the time being.) next, i sleuthed which page-scans were bad -- 14: > http://z-m-l.com/go/chris/chris-badpages.html i manually corrected the o.c.r. from those pages, and -- voila! -- a total of 95 lines were changed as a result. > http://z-m-l.com/go/chris/chris-badpages95changes.html note these 14 pages were just the _really_awful_ scans. i suspect there are at least another 100+ lines that were made incorrect because of a _partly_ bad scan of a page. so bad scans and a bad o.c.r. program injected over _1,000_ errors into this book. totally unnecessarily. that's the damage incompetent content providers cause, which forces _volunteer_ proofers to do needless work... i understand why piggy wants to "refocus" responsibility. his actions injected 1,000+ errors into a book for which the number of _o.c.r._errors_ was less than a thousand... the first rule of medicine is "do no harm"... *** now let's look at how many of the remaining changes were due solely and completely to stupid d.p. policies, yet another place where unnecessary work is caused... first, i auto-joined end-of-line hyphenates in the o.c.r. it's worth noting -- when using a line-based analysis -- that a rejoined hyphenate unnecessarily causes _two_ meaningless differences to pop up, one on the top line, and another on the bottom line. by doing the rejoining _automatically_, on the o.c.r. text, before it goes to p1, i eliminated those needless differences from appearing. (and if d.p. had done it, it would've saved proofers work.) this book had 950 end-of-line hyphenates to be rejoined: > http://z-m-l.com/go/chris/chris-950eolhyphenates.html what that means is that _1,900_ of the changed lines noted in the original figure of 3,724 -- over _half_ -- involved _meaningless_ (and unnecessary) changes... not only that, but that task _could_ be done in seconds -- literally _seconds_ -- by a good preprocessing tool. (and the programming to do this task is dirt-simple...) think about that for a minute. one single stupid policy accounted for _half_ the changes proofers had to make. that's mind-boggling... and an excellent example of just how dumb the distributed proofreaders workflow really is. *** yes, i'm well aware that a certain number of d.p. projects are not impacted by these _particular_ problems, but i'm also well aware that a certain number -- perhaps half -- have _some_ type of similar problem if you look at them. and usually it doesn't take very much looking to find it... for example, the "planet strappers" book had good scans, and good o.c.r., but then an incompetent content provider went and carelessly changed all em-dashes to en-dashes, forcing the proofers to correct 1,137 unnecessary errors... this is the kind of ineptitude d.p. needs to correct. i found that error within 5 minutes. and if _i_ had been the person who made that error, i woulda fixed it myself, before it even left my desk to go on to the next volunteer. *** moreover, there are other stupid d.p. policies that caused other changes to be made unnecessarily. for instance... i also standardized _ellipses_ across the different versions. just so you know, there were over 100 ellipses in this book. (who knows how many times they might've been reworked, with people changing 3-dots to 4-dots, then back again...) i also "clothed" the end-line em-dashes automatically, but there were only 28 of 'em, so it doesn't account for much... i also eliminated diacritics, but again, there weren't many... and i deleted garbage characters, like ` and \ and so on... i did a bunch of other little stuff too, but i can't remember; whatever i did was rather primitive... nothing fancy here... nothing outside basic standards of what _should_ be done. still, when you add it all up, you get about 2,000 changes -- that's right, _two_thousand_ -- caused by d.p. policies! so... bad scans and bad o.c.r. accounted for 1,000 bad lines, and d.p. "policies" necessitated another 2,000 changed lines, which means about 3,000 of the original 3,724 changed lines can be chalked up to _avoidable_changes_ on the part of p1... to put it starkly, ~3,000 of the 3,724 lines that proofers had to change -- and let's not forget that the p1 proofers did indeed have to use their time and energy to _change_ all those lines -- shouldn't have required any attention at all. not a single ounce. 3,000 out of 3,724 is _80%_... so 80% of the proofers' energy in finding and fixing those lines was _wasted_, plain and simple -- a figure that is right in line with my data for other books -- and that's just unconscionable. i can't help but think that if the proofers _knew_ this ugly fact, that they would be leaving distributed proofreaders in droves. how much money would you give to charity if you knew they burned 80% -- literally _burned_ it -- before using the rest? i've known the problem was bad for years. i didn't have a figure to hang on it, but i knew it was bad. now, though, with a number like 80%, maybe you understand why i think this is a moral issue... and i'm gonna keep computing -- and publicizing -- these figures until the "leadership" at distributed proofreaders fixes the problem. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080407/7d6643ba/attachment.htm From Bowerbird at aol.com Mon Apr 7 13:53:59 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Apr 2008 16:53:59 EDT Subject: [gutvol-d] very interesting great idea. perhaps patentable too. Message-ID: ok, so the other day, jeroen -- you know, one of the guys who keeps coming on here and calling me a "troll" -- posted _this_ over on the d.p. forums: > I was thinking about alternative ways to look at texts, > and came to the following concept: a text heat map that > colors the text based on potential issues. This is partly > based on ideas in word-check, but taken to a further level. ... > For further details: http://www.pgdp.net/wiki/User:Jhellingman/Tools#TextHeatMap among the responses was this: > That looks very interesting indeed! and this: > Wow - what a great idea! and this: > Really neato ideas. Perhaps patentable too. *** of course, people _here_ will recognize that this is an idea that _i_ floated here a couple of weeks back. my post went like this: > i've tested (and enjoyed!) a page display where > every word is _colorized_ independently... > the higher the frequency of the word, the lighter it became, so > very common words like "and" and "the" were practically white. > words with just one occurrence in the book were _pure_black_. > low-frequency words which weren't in the dictionary were red. > inconsistent hyphenation, spelling, and so on were turned blue. so i think it's kind of amusing that jeroen is working on it now. and even _more_ amusing that he's so "vague" on its genesis... i mean, it's fine with me. i give voice to my ideas _precisely_so_ other people can run with them if they want. so much the better. and it's not like _coloration_ is some new, unheard-of thingee... my goodness, i've been employing coloration for _decades_ now, even using it frequently in the data presenting i've done recently. so god bless you jeroen, and good luck with "your" work... :+) -bowerbird p.s. besides, when someone else tries to implement my ideas, they almost invariably fail to get the most crucial _details_ right. ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080407/7160a778/attachment.htm From jeroen.mailinglist at bohol.ph Mon Apr 7 14:50:14 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Mon, 07 Apr 2008 23:50:14 +0200 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: References: Message-ID: <47FA9716.1020305@bohol.ph> Bowerbird at aol.com wrote: > ok, so the other day, jeroen -- you know, one of the guys who > keeps coming on here and calling me a "troll" -- posted _this_ > over on the d.p. forums: > > [...] > of course, people _here_ will recognize that this is an idea that > _i_ floated here a couple of weeks back. [...] > so i think it's kind of amusing that jeroen is working on it now. > and even _more_ amusing that he's so "vague" on its genesis... > > [...] > > so god bless you jeroen, and good luck with "your" work... :+) > > Thanks so much, BB! I have to admit that I read your note, and thought, he's got a point here. However, the difference between an idea and doing something with it, between the creative spark and the implementation is spending some time on it.... Doing is what we need to get things done. > -bowerbird > > p.s. besides, when someone else tries to implement my ideas, > they almost invariably fail to get the most crucial _details_ right. > > Please enlighten me, and I will be more polite in future. Jeroen. From traverso at posso.dm.unipi.it Mon Apr 7 15:03:20 2008 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Tue, 8 Apr 2008 00:03:20 +0200 (CEST) Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: <47FA9716.1020305@bohol.ph> (jeroen.mailinglist@bohol.ph) References: <47FA9716.1020305@bohol.ph> Message-ID: <20080407220320.D69CD93B62@posso.dm.unipi.it> >>>>> "Jeroen" == Jeroen Hellingman (Mailing List Account) writes: Jeroen> Bowerbird at aol.com wrote: >> ok, so the other day, jeroen -- you know, one of the guys who >> keeps coming on here and calling me a "troll" -- posted _this_ >> over on the d.p. forums: >> >> [...] of course, people _here_ will recognize that this is an >> idea that _i_ floated here a couple of weeks back. [...] so i >> think it's kind of amusing that jeroen is working on it now. >> and even _more_ amusing that he's so "vague" on its genesis... >> >> [...] >> >> so god bless you jeroen, and good luck with "your" work... :+) >> >> Jeroen> Thanks so much, BB! I have to admit that I read your note, Jeroen> and thought, he's got a point here. Jeroen> However, the difference between an idea and doing Jeroen> something with it, between the creative spark and the Jeroen> implementation is spending some time on it.... Doing is Jeroen> what we need to get things done. Not to mention that the idea was discussed in the DP forum a couple of YEARS ago, and probably more. And it was also partly implemented by cpeel in his PunctCheck, currently in the test site. Carlo Traverso From Bowerbird at aol.com Mon Apr 7 17:39:29 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Apr 2008 20:39:29 EDT Subject: [gutvol-d] very interesting great idea. perhaps patentable too. Message-ID: jeroen said: > Thanks so much, BB! I have to admit that I read your note, and thought, > he's got a point here. anyway, i wasn't looking for "credit"... really... it truly doesn't matter to me. d.p. could "steal" all of my ideas, today, and i'd be tickled pink... > However, the difference between an idea and doing something with it, > between the creative spark and the implementation is > spending some time on it.... Doing is what we need to get things done. as i made clear in my recent post, i've implemented coloration for years. and as i made clear in my original post, i have even implemented the _frequency-based_ coloration in my recent programs, and _loved_ it. indeed, it was only because the tactic worked _so_ well that i decided i would share it with people. i try a lot of stuff out. some of it works, and a lot of it doesn't. but the stuff that works _really_ well, i notice... in fact, you can pretty much count on the fact that if i _recommend_ something, it's because i have actually _implemented_ it myself and found that it really, actually, truly does work well. until i've done that, i won't stick my neck out... there is no reason to risk my credibility... > Please enlighten me, and I will be more polite in future. well, first of all, i don't really care if you're "more polite in future" or not. i've dealt with the name-calling on this listserve for many years now, and i can deal with it for many years to come, if i have to. doesn't bother me... i think it reflects poorly on the name-callers. but it doesn't bother _me_... and as for "enlightening" you, i haven't yet looked closely enough at your web-page in order for me to comment very extensively... however... a quick glance indicates that you don't yet fully appreciate the importance of the _frequency_ variable, since you colorize the high-frequency words... (you give them a different colorizing, yes, but you still flag them with color.) my research indicates a high-frequency word should be treated as correct... because -- in most cases -- it _will_indeed_ be correct, assuming good o.c.r. furthermore, in the rare cases where a high-frequency word _is_ a scanno, that high frequency makes it _likely_ that _one_ occurrence will be detected. and my practice is that _any_ unflagged word which is found to be incorrect triggers a search throughout the entire book looking for that specific word, so _one_ occurrence is all that _needs_ to be detected. so, for instance, to use the "christopher and the clockmakers" book that i've been researching recently, when one occurrence of "mr. button" was corrected to "mr. burton", that would've triggered a search for "button" book-wide, which would have turned up the other 11 occurrences of it... this reflects my overall preference to _underflag_ rather than _overflag_. in a normal proofing pass, i want every _flagged_ word to be _wrong_, with an _extremely_ high probability, something along the lines of .9... simply, i don't want to waste the attention-grabbing power of the flag on a word that _might_ be wrong, even if there's a 50-50 chance it is... now, i also want to give proofers the ability to "turn up the volume". so if they _want_ to have any word that is _possibly_ wrong flagged -- with any level of probability the proofer would care to specify -- they can get that. but a _default_flagging_ should mean "must fix". this is another place where i differ -- significantly -- from d.p. policy. to my mind, once a proofer has removed a flag from a word, the flag should _stay_removed_ when the _next_ proofer looks at that word... but d.p. requires a clumsy extra step, where the project manager must "approve" the deflagging of the word, a responsibility that many shirk. as if there weren't enough roadblocks, they had to throw up another... *** but anyway, back to your implementation, jeroen... in addition to the overflagging of words, you overflag punctuation... when you overflag, you undermine the effectiveness of those flags... since much of the punctuation is flagged, including quotemarks, places where the punctuation is likely to be incorrect go unnoticed. for instance, all these lines are the last lines in a paragraph, yet none seem to have paragraph-terminating punctuation: > the Wild Tribes of Mindanao, Moro, and Christian, > and oiled them and said to them, > went to Aponibalagen in Nalpangan and said, > must obey, said to the betel-nut, > it was covered with gold, > Unable to tell where the noise came from, he sat down again, > "or I shall grow on your knee," > must pay as a marriage price for Dapilisan," > you must say that you do not know where I am," > There is no rice, nor beef, nor pork, nor chicken," > "Let us throw him into the water," > if you fail you shall be punished severely," (and search for "asleep at night" to see there are more of 'em.) plus i suspect one of these should be considered "wrong": > A.C. McClurg > A. C. McClurg but both of them look to be flagged the same way... and why should any capital-period string be flagged anyway? initials are quite common. further, why is the word "a" flagged when it starts a sentence? and parentheses and braces and brackets shouldn't be flagged unless they happen to be _unbalanced_, and in such situations, the flag should be seriously more strident than what you use... finally, maybe this is just me, but i can't find any way to get the highlighting to _stick_ when i copy the text out of the browser, to paste it into another application, i.e., one where i can _edit_, which seriously hinders an ability to use this capability to edit... and people, i'm giving this feedback because jeroen asked me. i'm not picking on his implementation, or attacking the details. he's done some nice work here, and i congratulate him for that. constructive criticism is a _gift_... *** carlo said: > Not to mention that the idea was discussed in the DP forum > a couple of YEARS ago, and probably more. And it was also partly > implemented by cpeel in his PunctCheck, currently in the test site. oh, carlo. here jeroen is being all nice, and he's the one i poked fun at. but then you come in all poopy, and i wasn't even talking about you... look, i've been using coloration for _decades_, since ventura publisher enabled that capability for people, which was way back in 1987 or so... so it was only natural for me to use it to reflect _frequency_ information. if you just got the idea a couple of years ago, you're way behind, buddy. and as for cpeel and wordcheck, and punctcheck too, he's a latecomer... it's _great_ that he finally came along, because people like _you_, carlo, were happy to let proofers struggle under a "spellchecker" that was truly world-class bad, and cpeel finally rescued d.p. from all _that_ ignominy. but he's still a latecomer to this game. valuable addition. but latecomer. and furthermore, i _tried_ and _tried_ to get you idiots over at d.p. to use a frequency-based flagging methodology in wordcheck, and not dictionary-based flagging, but nobody over there seemed to "get it"... the forum-threads are still there. go read 'em if you need refreshing. if i remember right, i even made the post that prompted punctcheck. the fact is, the _interesting_ part of my idea -- the "crucial detail" -- is that the coloration-flagging is based on _frequency_ of the word _in_the_current_book_, not whether it's present in some "dictionary". and you, carlo, were one of the main antagonists to _that_ notion... you were insistent that the project manager should do hard work to make the "good wordlist", rather than letting frequency-data make it. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080407/ad7a1fe6/attachment-0001.htm From Bowerbird at aol.com Mon Apr 7 21:17:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Apr 2008 00:17:42 EDT Subject: [gutvol-d] it had to happen sooner or later Message-ID: scribd.com -- an online document service -- is offering now to scan and o.c.r. and host any documents for free. > http://www.scribd.com/paper they're probably not thinking what you and i are thinking... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080408/1ee42cf8/attachment.htm From julio.reis at tintazul.com.pt Tue Apr 8 01:57:02 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Tue, 08 Apr 2008 09:57:02 +0100 Subject: [gutvol-d] bbird's (?) idea In-Reply-To: References: Message-ID: <1207645022.6941.27.camel@abetarda> > > of course, people _here_ will recognize that this is an idea that > > _i_ floated here a couple of weeks back. my post went like this: > > > i've tested (and enjoyed!) a page display where > > > every word is _colorized_ independently... > > > the higher the frequency of the word, the lighter it became, so > > > very common words like "and" and "the" were practically white. > > > words with just one occurrence in the book were _pure_black_. > > > low-frequency words which weren't in the dictionary were red. > > > inconsistent hyphenation, spelling, and so on were turned blue. Well... signal to noise, perhaps? Of course you're bound to have some good ideas, everyone does I am sure. Yet you write so much that few people would bother to sift through your posts to find those gems. No offence meant, I just wonder why you *write* so much. (bowerbird pet haters, no replying please.) I also agree with whoever said that we need to put some effort to turn those good ideas into solid prototypes, and not just drop them. Or else someone will lift them and do something useful with them. About patents... oh please. "In Capitalist America, patent owns *you*." Colourising text according to relevance is nice, but hardly the quantum leap I would consider patentable. Sheesh. But yeah, go ahead with the implementation; anything to help us proof better. From hart at pglaf.org Tue Apr 8 12:15:44 2008 From: hart at pglaf.org (Michael Hart) Date: Tue, 8 Apr 2008 12:15:44 -0700 (PDT) Subject: [gutvol-d] Is Jon Noring Still Alive??? Message-ID: I haven't heard anything from or about him for a while. . . . Michael From Bowerbird at aol.com Tue Apr 8 13:03:31 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Apr 2008 16:03:31 EDT Subject: [gutvol-d] Is Jon Noring Still Alive??? Message-ID: michael said: > I haven't heard anything from or about him for a while. . . . as i said the first time you asked, he was alive just two weeks ago... > http://www.teleread.org/blog/2008/03/23/our-publish-then-filter-future/ > http://groups.yahoo.com/group/ebook-community/message/28962 > http://groups.yahoo.com/group/ebook-community/message/28991 > http://finance.groups.yahoo.com/group/Self-Publishing/message/79416 > http://finance.groups.yahoo.com/group/Self-Publishing/message/79798 > http://finance.groups.yahoo.com/group/Self-Publishing/message/79828 > http://finance.groups.yahoo.com/group/Self-Publishing/message/80100 -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080408/0bd113f7/attachment.htm From Bowerbird at aol.com Tue Apr 8 13:30:34 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Apr 2008 16:30:34 EDT Subject: [gutvol-d] bbird's (?) idea Message-ID: julio said: > Well... signal to noise, perhaps? it's my detractors outputting noise. left to my own devices, i'm all signal... > Of course you're bound to have some good ideas, > everyone does I am sure. you're too kind... :+) > Yet you write so much that > few people would bother to > sift through your posts to find those gems. i dunno. seems to me lots of people read my posts. including you. :+) > No offence meant, I just wonder why you *write* so much. > (bowerbird pet haters, no replying please.) i have a lot to say... ;+) plus i think good dialog _should_ be happening here in the lobby of the project gutenberg library. so when someone makes a good-faith effort at conversation -- like you have done here, julio -- i take the time to respond with a good-faith effort. to my mind, that indicates a respect for the person. when i no longer respect a person, i stop replying. > I also agree with whoever said that > we need to put some effort to > turn those good ideas into solid prototypes, > and not just drop them. as i said just recently... i always have solid prototypes for my ideas before i even bother to make them public... if i don't _know_ -- with certainty -- what i'm talking about, i am reluctant to say anything... that's why it's so hard to catch me in an untruth. i'm not saying it _never_ happens -- happened just recently when juliet pointed out to me that they _do_ do _some_ preprocessing over at d.p., including auto-rejoining of end-line hyphenates -- but it's rare, because i check my facts first... > Or else someone will lift them > and do something useful with them. let me say it again. i _love_ it when someone does "something useful" with one of my ideas. i'm not all "proprietary" about my ideas. they are a gift from the goddess of ideas... it's not like i "own" them or anything. on the contrary, i'm rather fickle in regard to what i think, in the sense that i will change my mind _immediately_ whenever i'm presented with another opinion that does a better job of resonating the tuning fork of truth. the only reason i brought this case up was because i was amused because jeroen attacks me as a "troll" on the one hand, then uses my idea on the other... and since i seem to be repeating myself in this post, i wonder if you read _all_ the posts in an exchange, julio, or if you just jump in without having done that. i can't stop you from doing that, of course, but still... i _will_ say that sometimes when a person does that, they'll make a point that was already addressed in a previous post, which makes them look kind of silly, and also drags down the whole thread in backtracks, making it less useful to the people who _do_ read it all. > About patents... oh please. "In Capitalist America, > patent owns *you*." Colourising text according to > relevance is nice, but hardly the quantum leap > I would consider patentable. Sheesh. in addition to reading the thread, i also recommend reading it _carefully_enough_ to avoid misattribution. i didn't bring up "patents". somebody else did, elsewhere. and for all i know, they did it with a humorous take in mind. but i agree with you, fully, that a patent would be ridiculous. (which, by the way, seems to be _exactly_ the type of thing for which the patent office is awarding patents these days.) > But yeah, go ahead with the implementation; > anything to help us proof better. d.p. has never been interested in my software, to the point that i'm now totally uninterested in making software for d.p., to the point i will intentionally cripple anything that i make so that it can be used fully by anyone _except_ d.p. (karma.) but i've already sent my good wishes to jeroen in his efforts. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080408/504730a8/attachment.htm From Bowerbird at aol.com Tue Apr 8 14:06:43 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Apr 2008 17:06:43 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 08 Message-ID: ok, here's another easy one... it might not be obvious just how _little_ i'm asking, so let me make that specific. i've talked about the crappy scans that were made: >?? http://z-m-l.com/go/chris/chris-badpages.html but the only reason those scans are crappy is their band of black at the left-hand side, on recto pages. there's nothing _inherent_ in the book causing that. it was simply a matter of _insufficient_carefulness_ exercised when those pages were being scanned... and even _after_ that scanning, most of the images -- all but 2 -- are fixable with a graphics program, in a matter of mere minutes, by erasing that band... next comes the o.c.r. it took me under a half-hour to re-do the o.c.r. using abbyy finereader, which is less time (i'm sure) than it took to fix all the scannos which were injected by using the inferior tesseract. and if abbyy woulda been used in the _first_ place... running the o.c.r. results through a clean-up tool takes no more than 15 minutes for a typical book. *** all in all, one hour of time by the content provider -- or, really, _anybody_ prior to the proofings -- could've repaired the bad actions that he'd made, and saved the p1 proofers thousands of changes... and -- without the bad actions in the first place -- all those changes could have been saved by a mere 15 minutes of work on his part with a clean-up tool. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080408/a0fcd7c9/attachment.htm From bzg at altern.org Tue Apr 8 14:17:38 2008 From: bzg at altern.org (Bastien) Date: Tue, 08 Apr 2008 23:17:38 +0200 Subject: [gutvol-d] bbird's (?) idea In-Reply-To: (Bowerbird@aol.com's message of "Tue, 8 Apr 2008 16:30:34 EDT") References: Message-ID: <87d4p09hcd.fsf@bzg.ath.cx> Bowerbird at aol.com writes: > i dunno. seems to me lots of people read my posts. Don't rely too much on your own wild guesses here. > plus i think good dialog _should_ be happening > here in the lobby of the project gutenberg library. Good dialog does not happen magically. Some conditions need to be met, namely a friendly atmosphere and the general feeling that people are willing to contribute by reading and taking your writings into consideration, as much as they are willing to contribute with their own inputs. Sorry but right now your monolog is just a nuisance. It looks like you don't care about those conditions. For example you don't care that many readers are not english native speakers, so they might be fed up by the length of your posts. The web is there for you, get a blog. -- Bastien From Bowerbird at aol.com Tue Apr 8 14:58:56 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Apr 2008 17:58:56 EDT Subject: [gutvol-d] bbird's (?) idea Message-ID: bastien said: > Don't rely too much on your own wild guesses here. you seem to have read my post. no "wild guess" there... > Good dialog does not happen magically.? "magically"? well... very few things happen "magically". almost none, i'd say. and even that might be optimistic. but good conversation isn't all that rare. maybe _here_ it is, but i have good dialog with my friends all the time. it's not even that hard. they talk. i talk back. they talk back. it's usually quite easy... sometimes it's absolutely effortless... > Some conditions need to be met, > namely a friendly atmosphere i seem to be able to contribute even in a hostile atmosphere. but maybe that's just me... > and the general feeling that people are willing to > contribute by reading and taking your writings > into consideration, as much as they are willing to > contribute with their own inputs. well, i don't know about you, or anyone else, but i'm certainly willing to read people's posts. there are some who -- after _many_ years of reading (and responding to) their posts -- i have given up on, because after all those years of reading their messages, i became thoroughly convinced they had nothing to add, or at least nothing new to add, so i stopped reading them. but i certainly showed a willingness to consider their posts. > Sorry but right now your monolog is just a nuisance. this must be that "friendly atmosphere" you talked about above. > It looks like you don't care about those conditions. that's quite a job of projecting you're doing there, bastien. it must've been a very good rorschach blot i put out there... > For example you don't care that > many readers are not english native speakers, > so they might be fed up by the length of your posts. my posts are as long (or as short) as they need to be in order to say what i want them to say, though i often wish the topics i talk about weren't quite so complex, so i could deal with them in a more cursory fashion, without sacrificing the depth of analysis i want to do... but i fully understand if anyone -- especially those subscribers who are not english-native speakers -- don't have enough time to read them in their entirety. after all, it's a busy world; people are pressed for time. of course, i also wish people here in the lobby of the project gutenberg library would stay _on-topic_ and talk about electronic-books and their digitization, instead of constantly going off-topic to talk about personal issues, but i guess that ain't gonna happen, because people sure do like to discuss the bowerbird, don't they? i mean, heck, there are significant _moral_ issues on the table, but people seem to have _nothing_ to say about them, they're too busy talking 'bout bird... > The web is there for you, get a blog. gee, bastien, thanks for the advice... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080408/02268774/attachment-0001.htm From bzg at altern.org Tue Apr 8 15:03:09 2008 From: bzg at altern.org (Bastien) Date: Wed, 09 Apr 2008 00:03:09 +0200 Subject: [gutvol-d] bbird's (?) idea In-Reply-To: (Bowerbird@aol.com's message of "Tue, 8 Apr 2008 17:58:56 EDT") References: Message-ID: <87r6dg6m3m.fsf@bzg.ath.cx> Bowerbird at aol.com writes: > but good conversation isn't all that rare. maybe _here_ > it is, but i have good dialog with my friends all the time. As an exercise: try to wait that four different people post on this list before you reply to one of them. You'll see. Some conversation may magically appear then. (Of course I wanted to see three, not four, but I knew you would count this post as one.) -- Bastien From Bowerbird at aol.com Tue Apr 8 15:23:09 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Apr 2008 18:23:09 EDT Subject: [gutvol-d] bbird's (?) idea Message-ID: bastien said: > As an exercise: try to wait that four different people post > on this list before you reply to one of them.? You'll see.? > Some conversation may magically appear then. i see. you want to have a conversation _without_ me. :+) ok, go ahead. you start a thread, and i won't reply until: (a) a half-dozen people respond to it, or (b) a dozen posts are made. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080408/854b7055/attachment.htm From hart at pglaf.org Wed Apr 9 11:36:30 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 9 Apr 2008 11:36:30 -0700 (PDT) Subject: [gutvol-d] Is Jon Noring Still Alive??? In-Reply-To: References: Message-ID: On Tue, 8 Apr 2008, Bowerbird at aol.com wrote: > michael said: >> I haven't heard anything from or about him for a while. . . . > > as i said the first time you asked, he was alive just two weeks ago... sorry, I didn't seem to get that message. . . . and I also emailed him directly with no response. > >> http://www.teleread.org/blog/2008/03/23/our-publish-then-filter-future/ > >> http://groups.yahoo.com/group/ebook-community/message/28962 >> http://groups.yahoo.com/group/ebook-community/message/28991 > >> http://finance.groups.yahoo.com/group/Self-Publishing/message/79416 >> http://finance.groups.yahoo.com/group/Self-Publishing/message/79798 >> http://finance.groups.yahoo.com/group/Self-Publishing/message/79828 >> http://finance.groups.yahoo.com/group/Self-Publishing/message/80100 > > -bowerbird > > > > ************** > Planning your summer road trip? Check out AOL Travel Guides. > > (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) > > From jeroen.mailinglist at bohol.ph Wed Apr 9 12:00:22 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Wed, 09 Apr 2008 21:00:22 +0200 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: References: Message-ID: <47FD1246.5050603@bohol.ph> Bowerbird at aol.com wrote: >> Please enlighten me, and I will be more polite in future. >> > > and as for "enlightening" you, i haven't yet looked closely enough at your > web-page in order for me to comment very extensively... > > > my research indicates a high-frequency word should be treated as correct... > > My own experience corresponds with that. Singletons get colored red. Twins orange, and from then it quickly goes to lighter shades of yellow, and things appearing often are just as white as words in the dictionary. > so, for instance, to use the "christopher and the clockmakers" book that > i've been researching recently, when one occurrence of "mr. button" was > corrected to "mr. burton", that would've triggered a search for "button" > book-wide, which would have turned up the other 11 occurrences of it... > > That is a valuable suggestion. Part of the idea is already in PG, if you use the suggestions for bad word interface. It tells you which words have been changed. The scanno coloring will do something similar, but is definitely more work, as I need statistics on words in context for that. > but d.p. requires a clumsy extra step, where the project manager must > "approve" the deflagging of the word, a responsibility that many shirk. > as if there weren't enough roadblocks, they had to throw up another... > > I agree here. I typically spend time accepting all flagged words, except those obviously wrong. > *** > > but anyway, back to your implementation, jeroen... > > in addition to the overflagging of words, you overflag punctuation... > > Agreed. I suffer from lack of a "punctuation dictionary" so everything is based on frequency. > when you overflag, you undermine the effectiveness of those flags... > > Agreed. > >> the Wild Tribes of Mindanao, Moro, and Christian, >> and oiled them and said to them, >> went to Aponibalagen in Nalpangan and said, >> must obey, said to the betel-nut, >> it was covered with gold, >> Unable to tell where the noise came from, he sat down again, >> "or I shall grow on your knee," >> must pay as a marriage price for Dapilisan," >> you must say that you do not know where I am," >> There is no rice, nor beef, nor pork, nor chicken," >> "Let us throw him into the water," >> if you fail you shall be punished severely," >> > > Here a funny thing happens. I noticed these, went back to the scans, and saw this actually in the source. I now mark these cases with a tag in the (master SGML) source. > (and search for "asleep at night" to see there are more of 'em.) > > plus i suspect one of these should be considered "wrong": > >> A.C. McClurg >> A. C. McClurg >> That again is a pain. I haven't been able to make a consistent choice. Yet. For some time I used half spaces as a compromise. I need to get consistent in these. > but both of them look to be flagged the same way... > > further, why is the word "a" flagged when it starts a sentence? > > Dictionary issue: it has no one-letter words. > and parentheses and braces and brackets shouldn't be flagged > unless they happen to be _unbalanced_, and in such situations, > the flag should be seriously more strident than what you use... > > Agreed. This is something I am looking into, as with quotation marks. > finally, maybe this is just me, but i can't find any way to get the > highlighting to _stick_ when i copy the text out of the browser, > to paste it into another application, i.e., one where i can _edit_, > which seriously hinders an ability to use this capability to edit... > > I would consider that an issue of the browser or editor you are using. I am using CSS classes on spans to apply the colors. Another CSS can give you different colors at ease. > and people, i'm giving this feedback because jeroen asked me. > i'm not picking on his implementation, or attacking the details. > he's done some nice work here, and i congratulate him for that. > constructive criticism is a _gift_... > > And I am happy with your remarks. Thanks BB. Jeroen From grythumn at gmail.com Wed Apr 9 12:43:11 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Wed, 9 Apr 2008 15:43:11 -0400 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: <47FD1246.5050603@bohol.ph> References: <47FD1246.5050603@bohol.ph> Message-ID: <15cfa2a50804091243p398f1cd3we5e9ebb770aa75aa@mail.gmail.com> On Wed, Apr 9, 2008 at 3:00 PM, Jeroen Hellingman (Mailing List Account) < jeroen.mailinglist at bohol.ph> wrote: > > > > my research indicates a high-frequency word should be treated as > correct... > > > My own experience corresponds with that. Singletons get colored red. > Twins orange, and from then it quickly goes to lighter shades of yellow, > and things > appearing often are just as white as words in the dictionary. > This doesn't help much with stealth scannos. For example: "adventure of the speckled *band*" vs "adventure of the speckled *hand*" Or perhaps the more classic example: "It is impossible to say when an Asiatic stream began to pour into Europe over the *arid* steppes north of the Caspian." "It is impossible to say when an Asiatic stream began to pour into Europe over the *and* steppes north of the Caspian." On the first example, both band and hand are moderately uncommon words, and quite difficult to determine which is correct solely from the context provided. On the second example, "arid", while correct, would stand out like a sore thumb, while "and", incorrect, would disappear. IOW, while word frequency checks are a useful tool with a human in the loop, they are not suitable for automatic changes without a significant advance in natural language processing techniques. There are many books with low frequency, not in spellcheck, words that are indeed correct. Some classes of text with these characteristics include: Books with dialects. Books with many proper names. Books with lots of loanwords and placenames. Books with a lot of technical language and math. R C -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080409/299f6dd9/attachment.htm From jeroen.mailinglist at bohol.ph Wed Apr 9 13:38:06 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Wed, 09 Apr 2008 22:38:06 +0200 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: <15cfa2a50804091243p398f1cd3we5e9ebb770aa75aa@mail.gmail.com> References: <47FD1246.5050603@bohol.ph> <15cfa2a50804091243p398f1cd3we5e9ebb770aa75aa@mail.gmail.com> Message-ID: <47FD292E.9040306@bohol.ph> Robert Cicconetti wrote: > This doesn't help much with stealth scannos. For example: > "adventure of the speckled *band*" vs > "adventure of the speckled *hand*" > > Or perhaps the more classic example: > "It is impossible to say when an Asiatic stream began to pour into Europe > over the *arid* steppes north of the Caspian." > "It is impossible to say when an Asiatic stream began to pour into Europe > over the *and* steppes north of the Caspian." > > [...] while word > frequency checks are a useful tool with a human in the loop, they are not > suitable for automatic changes without a significant advance in natural > language processing techniques. > > I think, by just collecting statistics on word-pairs that involve scannos, I can very quickly color the second example. It is very clear. No need for difficult analysis here. (You will be surprised how much you can do with just word pairs and numbers.) The first example can only be verified against the source, even a carefully reading human cannot decide here. I am writing scripts to collect data from a 100 million word corpus (downloaded from www.dbnl.nl; mostly old dutch books; a similar exercise needs to be done for English. I will use the PG collection for that.) Maybe people can suggest other sources of bulk texts. (Wikipedia, Usenet, archives of newspapers on-line.) I tried to obtain a large corpus that has been better tagged, but these things are difficult to get for free (Universities sit on them like they are gold mines, especially the tagged, enriched ones, as I think scanno detection can be improved if you do some little grammar analysis). Of course, I am also looking at ways I can use the largest corpus of them all: Google. Some observations: those 100 million words include about 2 million unique words. Of those, one million are singletons. Plenty of these are (on first observation) spelling mistakes, OCR errors, or plain odd spellings (the corpus I use contains a lot of irregular stuff, such as medieval plays and old Dutch dictionaries). A huge number of accent variations, and a lot of injections of German words (my tools try to guess the language of each fragment before processing it. I drop things in anything but Dutch, but German is quite similar), and less so in other languages. The last couple of thousands of words are actually Greek (since Greek sorts after the Latin alphabet.) I've matched the words with a large modern spelling dictionary (about 3 gigs of words), using fuzzy rules that reflect historical orthography changes. I will now develop scripts to collect word-pairs, but only if both words occur more than 10 times and in at least 5 documents. This pair information, I will use (among other things) to create some scanno detection software. I will supplement this with word classification data, to improve detection when I do not have sufficient data. (To catch in the wild all words listed in a major dictionary you need more than a 100 million words) > There are many books with low frequency, not in spellcheck, words that are > indeed correct. Some classes of text with these characteristics include: > Books with dialects. Books with many proper names. Books with lots of > loanwords and placenames. Books with a lot of technical language and math. > I typically like to have those kind of books on-line, so I am confronted with such errors regularly. All those anthropological works with stories in languages most people never heard of... Dictionaries, Gazetteers, Volume after volume of animals. That is why I need more powerful tooling. Jeroen. From Bowerbird at aol.com Wed Apr 9 14:00:02 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Apr 2008 17:00:02 EDT Subject: [gutvol-d] very interesting great idea. perhaps patentable too. Message-ID: jeroen said: > Singletons get colored red. Twins orange, and from then > it quickly goes to lighter shades of yellow, everything that is colorized seems flagged to me. might want to check if other people feel the same. > and things appearing often are > just as white as words in the dictionary. that wasn't true of the page i looked at. all the names -- which were high-frequency occurrences -- were colorized. the overflagging of names was the sore point in the _old_ spellchecker, and one of the huge advances of wordcheck. > Here a funny thing happens. I noticed these, went back > to the scans, and saw this actually in the source. i'd consider that to be an outdated typographic convention, and make corrections. *** robert said: > This doesn't help much with stealth scannos. For example: it's not intended to. stealth scannos need their own detection methods... > On the second example, "arid", while correct, would stand out > like a sore thumb, while "and", incorrect, would disappear. again, this isn't intended to catch _stealth_ scannos. but, just to steer you right, "arid" would not "stand out", let alone "like a sore thumb", because it's in the dictionary, thus unflagged. does this mean that all the "arid" cases that _should_ be "and" go unflagged at this step? why, yes, as a matter of fact, it does. because that's precisely what we _want_ to happen, at this step. a later step looks for stealth scannos, and we'll catch them then. > while word frequency checks are a useful tool with a human > in the loop, they are not suitable for automatic changes without > a significant advance in natural language processing techniques. why is there this stupid sort of assumption that automatic changes will never ever be reviewed by a human? where did that come from? there's this weird idea out there that we just turn the clean-up tool loose on the text, and it makes changes totally outside our control, running amok like a tiny king kong amid the new york of our text... -bowerbird p.s. oh, and with your second example, a stealth scanno detection strategy is to flag atypical dyads and triads, such as "over the and"... ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080409/fc052fe2/attachment.htm From Bowerbird at aol.com Wed Apr 9 14:23:55 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Apr 2008 17:23:55 EDT Subject: [gutvol-d] very interesting great idea. perhaps patentable too. Message-ID: jeroen said: > I am writing scripts to collect data from a 100 million word corpus > (downloaded from www.dbnl.nl; mostly old dutch books; > a similar exercise needs to be done for English. > I will use the PG collection for that. that's one way of going about the task. another way is to simply do it on-the-fly, using a couple dozen existing e-texts... (or more, if your program is fast enough.) i think you'll find the second is nearly as good as the first, and a whole lot simpler to manage. flexibility is a good asset to have with your feet. huge databases weigh you down like an anchor. > Of course, I am also looking at ways I can use > the largest corpus of them all: Google. well, if you really want that, google has made it available. their corpus had a trillion words, if i remember correctly, and they split it up into dyads, triads, and what have you. again, i believe this would be overkill. but it _is_ for sale. *** robert said: > There are many books with low frequency, not in spellcheck, > words that are indeed correct. of course. > Some classes of text with these characteristics include: > Books with dialects. Books with many proper names. > Books with lots of loanwords and placenames. > Books with a lot of technical language and math. ok, let's look at those individually... > Books with dialects. i can see we probably disagree on what constitutes "low-frequency". i think anytime you get the same string out of o.c.r. 4 times or more, you no longer have a word that is "low-frequency"... so, for dialect, many of the terms are going to occur more than 4 times in the book. so you don't need at those words. you'll look at the ones that don't... > Books with many proper names. there are lots of dictionaries of names. names are no problem. > Books with lots of loanwords and placenames. loanwords? as for placenames, again, lots of dictionaries of those. and besides, if you actually _look_ at the p.g. e-texts, you'll see many of the same names pop up repeatedly. and, to be more specific, if you check out gutenmark, you'll find ron burkey collected the p.g. placenames... > Books with a lot of technical language and math. well, i'll let the math people worry about the math books... those people have a lot (too much) on their plate already... *** you can manufacture things that one _might_ have to "worry" about when one uses a clean-up tool like this, but once you actually program one and use it, you find it goes pretty smooth. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080409/eddfe37c/attachment.htm From jeroen.mailinglist at bohol.ph Wed Apr 9 14:46:48 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Wed, 09 Apr 2008 23:46:48 +0200 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: References: Message-ID: <47FD3948.8070407@bohol.ph> Bowerbird at aol.com wrote: > > everything that is colorized seems flagged to me. > > To me, things in red jump far more into my than things with a light yellow background. I still need to tweak the LUT to make this work optimal, as, as you say, the overflagging is a big issue. It is a drain on the eye, and will hide the real problems in the text. I will probably try something like Occurrence: Once: red; Twice Orange; Thrice: Yellow, 4 to 8 times light yellow, 9-25 times even lighter yellow. (I like to have an idea of what does not appear in the dictionary) I now have a bit cruder scale. Jeroen. From jeroen.mailinglist at bohol.ph Wed Apr 9 15:02:22 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Thu, 10 Apr 2008 00:02:22 +0200 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: References: Message-ID: <47FD3CEE.6010708@bohol.ph> Bowerbird at aol.com wrote: > flexibility is a good asset to have with your feet. > huge databases weigh you down like an anchor. > Yup, but I have about a terabyte of storage at home, so its weight is relative. Processing such bulk gets the bottleneck. > well, if you really want that, google has made it available. > their corpus had a trillion words, if i remember correctly, > and they split it up into dyads, triads, and what have you. > > again, i believe this would be overkill. but it _is_ for sale. > > Great, but I have little money to spend on data, and you are right, it is overkill. I think collecting dyads from about a 100 million words, and limiting yourself to the common most 10.000 words already gives you plenty of data. >> Books with dialects. >> > > i can see we probably disagree on what constitutes "low-frequency". > > i think anytime you get the same string out of o.c.r. 4 times or more, > you no longer have a word that is "low-frequency"... so, for dialect, > many of the terms are going to occur more than 4 times in the book. > so you don't need at those words. you'll look at the ones that don't... > > I do works in dialects and minority language, and face several issues: - no official standard orthography (so people write as they like, with lots of variation) - limited knowledge of the language or dialect (or the rules of spelling applied) - to small sample to build a reasonable corpus. You probably need about half a billion words of text, from all fields, to collect a complete set of words included in a spelling dictionary. >> Books with many proper names. >> > > there are lots of dictionaries of names. names are no problem. > > Yep, try the ones available at the US Census. I used them to generate fake Patient Names for a Hospital Information System simulator. I then fed those randomly generated names into Google to get an associated birth date (Which I got from criminal records. With 1 in 100 Americans in prison, every other randomly generated name is a hit: try it, then think about what it means) > >> Books with lots of loanwords and placenames. >> > > I use the US military collected databases, but they are considerable overkill, and weigh in at about 200 M. Every tiny hamlet in every country is listed, often complete with old names. I have ideas to do something nice with those two. (Putting books on the map like google books does, but better) Jeroen. From Bowerbird at aol.com Wed Apr 9 16:29:00 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Apr 2008 19:29:00 EDT Subject: [gutvol-d] bbird's (?) idea Message-ID: i said: > you start a thread, and i won't reply until: > (a) a half-dozen people respond to it, or > (b) a dozen posts are made. take your time, bastien... no pressure... whenever you're ready... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080409/71edd674/attachment.htm From grythumn at gmail.com Wed Apr 9 21:13:41 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Thu, 10 Apr 2008 00:13:41 -0400 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: <47FD292E.9040306@bohol.ph> References: <47FD1246.5050603@bohol.ph> <15cfa2a50804091243p398f1cd3we5e9ebb770aa75aa@mail.gmail.com> <47FD292E.9040306@bohol.ph> Message-ID: <15cfa2a50804092113t315a34a0xffbc6c00430a8ea6@mail.gmail.com> On Wed, Apr 9, 2008 at 4:38 PM, Jeroen Hellingman (Mailing List Account) < jeroen.mailinglist at bohol.ph> wrote: > Robert Cicconetti wrote: > > This doesn't help much with stealth scannos. For example: > > "adventure of the speckled *band*" vs > > "adventure of the speckled *hand*" > > > > Or perhaps the more classic example: > > "It is impossible to say when an Asiatic stream began to pour into > Europe > > over the *arid* steppes north of the Caspian." > > "It is impossible to say when an Asiatic stream began to pour into > Europe > > over the *and* steppes north of the Caspian." > > > > [...] while word > > frequency checks are a useful tool with a human in the loop, they are > not > > suitable for automatic changes without a significant advance in natural > > language processing techniques. > > > > > I think, by just collecting statistics on word-pairs that involve > scannos, I can very quickly color the second example. It is > very clear. No need for difficult analysis here. (You will be surprised > how much you can do with just word pairs and numbers.) > Arid / and is fairly simple if you have good sentence parser AND the text contains regular sentence structure, or, as you suggested, it is near another word commonly associated with it (say, desert, landscape, tundra, et al). But you can't, as in the original passage I was objecting to, assume high-frequency words are correct. They are more likely to be, yes, but you can't assume it for the general case. The first example can only be verified against the source, even a > carefully reading human cannot decide here. > Exactly. I chose the first as it is the same part of speech, and either case can make sense in the limited context given. R C -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/d3c00b29/attachment.htm From walter.van.holst at xs4all.nl Thu Apr 10 00:38:54 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Thu, 10 Apr 2008 09:38:54 +0200 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: <47FD3CEE.6010708@bohol.ph> References: <47FD3CEE.6010708@bohol.ph> Message-ID: <47FDC40E.7060402@xs4all.nl> Jeroen Hellingman (Mailing List Account) wrote: >> well, if you really want that, google has made it available. >> their corpus had a trillion words, if i remember correctly, >> and they split it up into dyads, triads, and what have you. >> >> again, i believe this would be overkill. but it _is_ for sale. >> >> > Great, but I have little money to spend on data, and you are right, it > is overkill. I think collecting dyads from about a 100 million words, > and limiting yourself to the common most 10.000 words already > gives you plenty of data. The Google n-word corpus is only 160 USD, which in terms of real money isn't that much money anymore for those who can't have enough overkill. Its license limits it to research use, but the wording suggests that Gutenberg would fit in that category. Regards, Walter From bzg at altern.org Thu Apr 10 01:12:52 2008 From: bzg at altern.org (Bastien) Date: Thu, 10 Apr 2008 10:12:52 +0200 Subject: [gutvol-d] bbird's (?) idea In-Reply-To: (Bowerbird@aol.com's message of "Wed, 9 Apr 2008 19:29:00 EDT") References: Message-ID: <87wsn6xh4r.fsf@bzg.ath.cx> Bowerbird at aol.com writes: > i said: >> you start a thread, and i won't reply until: >> (a) a half-dozen people respond to it, or >> (b) a dozen posts are made. > > take your time, bastien... no pressure... whenever you're ready... :) I didn't propose to start a thread myself (you proposed this.) I proposed that you wait for *someone* to start a thread and for three different people to answer his thread before you jump on it. Or, to make the game more flexible, four people starting four threads would also be okay. -- Bastien From Bowerbird at aol.com Thu Apr 10 01:34:07 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Apr 2008 04:34:07 EDT Subject: [gutvol-d] bbird's (?) idea Message-ID: bastien said: > I proposed that you wait for *someone* to start a thread yeah, but since you're the one who voiced the complaint, isn't it only fair to expect that you would start the thread? after all, _you_ are "someone"... aren't you? and presumably you are one of the "someones" who feels my "jumping" on a thread intimidates them from posting... so hop to it! start a thread! you got your wish, so use it! :+) (besides, i don't buy your argument from the get-go, since so many people have already kill-filed me -- haven't they? -- so they don't even _know_ whether i've replied to any post.) > I proposed that you wait for *someone* to start a thread > and for three different people to answer his thread > before you jump on it.? Or, to make the game more flexible, > four people starting four threads would also be okay. maybe you haven't noticed, but i start most of the threads here. most especially the ones that end up having legs. just me trying to make the lobby of the project gutenberg library a lively place to be... if i sat around waiting for someone else to start a thread, and then for another 3 or 4 people to respond to it, i'd do little but sit... but hey, since you still seem to want to play this little game, anyone who puts "bastien rules!" at the start of their subject when they start a _brand-new_thread_ will -- if i happen to feel like playing along at the time -- receive a "pass" from me on my replies, until 3 people have responded, or a half-dozen posts have been made, or 36 hours have passed. how's that? now everyone can play! -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/5c82a099/attachment.htm From Bowerbird at aol.com Thu Apr 10 01:41:56 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Apr 2008 04:41:56 EDT Subject: [gutvol-d] very interesting great idea. perhaps patentable too. Message-ID: robert said: > But you can't, as in the original passage I was objecting to, > assume high-frequency words are correct. They are more likely to be, > yes, but you can't assume it for the general case. gosh, this is so idiotic, i don't know where to start. let's try this: of course you can assume high-frequency words are correct. you just do it. hopefully, you're not an _idiot_ about it, and you fully recognize that _sometimes_ your _assumption_ is going to prove to be _incorrect_... so your next step is to try and identify those instances. and usually you find a way to get _some_ of them, so you put that trick in your bag, and then try to identify the rest. and you keep whittling away at the problem until it's solved. people who throw up their hands and say "it's impossible" before they have even _tried_ to do something are irritating. you weigh the costs and benefits of a course of action and you try to jack up the benefits and lessen the costs. you do research, and use it to refine your methodologies. nitpicking every course of action at the outset causes paralysis. > Exactly. I chose the first as it is the same part of speech, > and either case can make sense in the limited context given. fine. you made up an example. bully for you. so what? now go to some real, actual, live, data, like, from a book, and show us exactly how often cases like that come up, as opposed to other instances which are not ambiguous. what you will find is that the unambiguous cases are _far_ more prevalent, so this strategy is a good one to follow... and if you had the balls to _try_ it, you'd have learned that. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/4ff3304c/attachment.htm From bzg at altern.org Thu Apr 10 01:57:29 2008 From: bzg at altern.org (Bastien) Date: Thu, 10 Apr 2008 10:57:29 +0200 Subject: [gutvol-d] bbird's (?) idea In-Reply-To: (Bowerbird@aol.com's message of "Thu, 10 Apr 2008 04:34:07 EDT") References: Message-ID: <87skxukrye.fsf@bzg.ath.cx> Bowerbird at aol.com writes: > (besides, i don't buy your argument from the get-go, since > so many people have already kill-filed me -- haven't they? -- > so they don't even _know_ whether i've replied to any post.) If you assume most people kill-filed you, it won't help moderating your posting rate, and it won't help you getting read. > maybe you haven't noticed, but i start most of the threads here. Meaning that there is a problem. Statistically, it is not possible that only you have new ideas that are worth sharing with others. > if i sat around waiting for someone else to > start a thread, and then for another 3 or 4 > people to respond to it, i'd do little but sit... Or try to cook your pudding? You know, the code behind your ideas. > but hey, since you still seem to want to play this little game, > anyone who puts "bastien rules!" at the start of their subject > when they start a _brand-new_thread_ will -- if i happen to > feel like playing along at the time -- receive a "pass" from me > on my replies, until 3 people have responded, or a half-dozen > posts have been made, or 36 hours have passed. how's that? That's silly. It's not about "Bastien's rule", it's about trying to make room for people to feel like it's worth sharing ideas here. The only time I wanted to share something here (some Emacs lisp code implementing a heuristic to help users correctly format a PG text file), I did'nt take the time to speak about this because I thought that you would flood this under your rants. I won't go further on this. I'll open a new thread when I have something to say. -- Bastien From hyphen at hyphenologist.co.uk Thu Apr 10 02:05:12 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Thu, 10 Apr 2008 10:05:12 +0100 Subject: [gutvol-d] very interesting great idea. perhaps patentable too. In-Reply-To: <47FD3CEE.6010708@bohol.ph> References: <47FD3CEE.6010708@bohol.ph> Message-ID: <003c01c89ae9$fcfac710$f6f05530$@co.uk> Jeroen Hellingman (Mailing List Account) Wrote Bowerbird at aol.com wrote: >> i can see we probably disagree on what constitutes "low-frequency". >> >> i think anytime you get the same string out of o.c.r. 4 times or more, >> you no longer have a word that is "low-frequency"... so, for dialect, >> many of the terms are going to occur more than 4 times in the book. >> so you don't need at those words. you'll look at the ones that don't... >> >> >I do works in dialects and minority language, and face several issues: >- no official standard orthography (so people write as they like, with >lots of variation) >- limited knowledge of the language or dialect (or the rules of spelling >applied) >- to small sample to build a reasonable corpus. You probably need about >half a billion words >of text, from all fields, to collect a complete set of words included in >a spelling dictionary. I also do works in dialect (Yorkshire) of which there are at minimum Three versions North, East and West Ridings, and arguably each town had its own dialect, each with its own description in a 19th century book. My mother in the 1920s could tell from which *valley* of Huddersfield a child came by its speech. There was/is absolutely no agreement on how to spell Dialect words. Different authors use different spellings for different vocalisations of the same words. The spelling of words differs even in a single work. Over the lifetime of a single author the spellings he used differed substantially. For works in English/American the languages, pre 1923, mostly 18?? were developing rapidly, and in different directions, so the orthography used in different works differs markedly. Any attempt, intended or otherwise, to impose modern English or American standards on old books would be a disaster and against the ethos of PG. Dave Fawthrop From Bowerbird at aol.com Thu Apr 10 10:51:34 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Apr 2008 13:51:34 EDT Subject: [gutvol-d] very interesting great idea. perhaps patentable too. Message-ID: dave said: > Any attempt, intended or otherwise, > to impose modern English or American standards > on old books would be a disaster and against the ethos of PG. well, now you're outside the arena of discussion of a tool and smack dab in the middle of a philosophical discussion. which is fine, but probably needs to be a separate thread... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/9b54be0c/attachment.htm From Bowerbird at aol.com Thu Apr 10 11:10:34 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Apr 2008 14:10:34 EDT Subject: [gutvol-d] bbird's (?) idea Message-ID: bastien said: > The only time I wanted to share something here > (some Emacs lisp code implementing a heuristic > to help users correctly format a PG text file), > I did'nt take the time to speak about this because > I thought that you would flood this under your rants. grow a pair, bastien. do enough research on your idea that you attain _confidence_ in its value, enough so that you gain a sense of certainty that you can _defend_ it and assert it into the world. that way you can ensure that it won't be "flooded" under anyone's "rants"... > I won't go further on this.? yeah, well, then, it probably doesn't merit much attention from anyone else. or so they will think. > I'll open a new thread when I have something to say. and i'll be waiting... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/0fc165e5/attachment.htm From Bowerbird at aol.com Thu Apr 10 12:33:04 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Apr 2008 15:33:04 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 09 Message-ID: in "christopher and the clockmakers", used for the second parallel proofing experiment at d.p., p1 proofers changed 3,724 lines from the o.c.r.: > http://z-m-l.com/go/chris/chris-3724changes-p1.html it's really a bummer p1 had to make so many changes... especially since we determined _3,000_ of those changes were _totally_unnecessary_, having been caused by either bad action (scans, o.c.r. choice) by the content provider, or stupid d.p. policies that can be done better by a computer. i culled the changes that were due to those factors and was left with just 546 "real" changes -- things that abbyy (a good o.c.r. app) got wrong, and thus _needed_ fixing... a first draft of "real" errors that needed "real" fixes is here: > http://z-m-l.com/go/chris/chris-abby-p1-546changes.html as we can see even on this draft, the number of "real" errors is even lower than 546, because quite a few of these changes do _not_ reflect o.c.r. errors on the part of abbyy. i examined and categorized the first 182 of these 546 -- a full one-third. (i've appended my categorization to this post, for anyone who wants to check it. the numbers listed are the error-number -- i.e., the leftmost number in the data on the above web-page.) i located 17 cases where the p1 line was incorrect, including 5 cases of "mr. button", where a p1 non-change was wrong... i also found 8 cases where my dehyphenation routine was too primitive, which would be eliminated when i improved it. finally, i found 33 cases where bad scans were likely to blame. (which makes my earlier estimate there were _100_ such lines look pretty darn good, since these 33 were in 1/3 of the file)... multiply these 58 cases by 3, for estimation on the full file, and you'd have 174 cases being eliminated from the 546, leaving us 372 "real" changes, which just happens to be 10% of the number of changes that were originally made. and of these 372 remaining cases, my estimate is that 75 could be detected by a clean-up tool looking at the aspect of sentence termination alone, since they involved mistakes on missing periods and misrecognized exclamation points. (those are listed below as well.) and of the 297 remaining errors, i'd guess that about _half_ could be detected by other clean-up routines, meaning that there were about 149 o.c.r. errors in this book that _required_ human attention to locate and correct. a far cry from 3,724... *** so once again, in this book, we have the exact same results we've had for the other books, namely that even though it might _appear_ that the o.c.r. needs thousands of changes, the vast majority of those -- between 80% and 90% -- are _avoidable_ and certainly _not_ the results of deficient o.c.r. they are just busywork, imposed by ineptitude or bad policy. when you cull the avoidable changes so that you are left with _only_ the o.c.r. errors, the number is _not_ in the thousands, but rather in the _hundreds_. if the content provider here wouldn't have been incompetent, the proofers would have had to correct a mere 372 errors, meaning that they would've spent (much) less time on this book, and moved it even closer to perfection. (not that they did such a terrible job as it was, considering they made all but 378 of 8,000 lines perfect, but imagine how much better they'd be if they wouldn't have been distracted by the 3,000 unnecessary changes they had to make.) the vast majority of the time and energy that is being donated to distributed proofreaders by the p1 proofers is being wasted. and i've proven this a number of times now, on several books, and d.p. is doing _nothing_ to change this abysmal situation... -bowerbird p.s. categorization of the first 182 of the 546 errors (1/3), indicating that these cases do not constitute "real" errors: the tesseract/p1 combo was wrong, not abby 52, 60, 66, 68, 75, 76, 150, 151, 152, 158, 164, 166 "mr. button" 57, 61, 67, 128, 142 my dehyphenation could be improved 45, 47, 115, 132, 133, 162, 177, 178 bad scan 70, 71, 78, 79, 80, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 99 100, 103, 104, 105, 106, 107, 136, 148, 149, 170, 171, 172, 173, 174 the following are indeed real o.c.r. errors, but the are errors that could have been easily detected by a clean-up program: missing period 51, 54, 55, 59, 72, 126, 127, 130, 141, 155, 156, 159, 161, 163, 165, 167, 168 exclamation mark 43, 44, 58, 64, 65, 81, 123, 134, 143 ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/6715a26d/attachment.htm From jon.ingram at gmail.com Thu Apr 10 13:38:40 2008 From: jon.ingram at gmail.com (Jon Ingram) Date: Thu, 10 Apr 2008 21:38:40 +0100 Subject: [gutvol-d] parallel -- chris and the clockmakers -- 09 In-Reply-To: References: Message-ID: <4baf53720804101338o7de52605h6c5771c75fc31df6@mail.gmail.com> On Thu, Apr 10, 2008 at 8:33 PM, wrote: > ... the proofers would have > had to correct a mere 372 errors, meaning that they would've > spent (much) less time on this book, and moved it even closer > to perfection. I've found it very interesting to read your recent series of posts. Your arguments and conclusions contradict many of the conclusions I draw from my experience with DP, both as a proofreader and a content provider. One reason for this may be that you have a different view to me, and to many others, as to what the goal and aims of something like DP should be. >From reading some of your posts, it seems that you assume, among other things, a) that proofers will spend less time on a page which has few errors than on a page with more errors; b) that it is a good thing for proofers to spend less time on a page; c) that, given a page with few errors, a proofer will produce a page which is more perfect than they would do given a page which has more errors; d) that proofers prefer to work on pages with few errors than on pages with more errors; e) that proofers find all errors equally taxing to correct; f) that these perpetual P1 books have been processed in a way representative of other books in DP; g) that the proofing of these books is representative of proofing on other books; h) that the people providing this material have not considered any of these issues already. These combine to make you form the value judgement that > the vast majority of the time and energy that is being donated > to distributed proofreaders by the p1 proofers is being wasted. which is an interesting point of view, and one I'm sure shared by several other people, not just about DP, but about other 'horde'-based collaborative systems. It may even be correct, but if so I believe it is a conclusion drawn from false premises, which you may want to re-examine. An alternative point of view, which contradicts several of the assumptions above: A significant number of people of people find it harder to work on projects which are 'near perfect' (say, less than one correction every couple of pages) than on ones which have at least one correction to do per page. I am one of these people. For people like this, improving the quality of a page beyond a certain level actually reduces our effectiveness as proofers. Instead of seeing every error corrected as a cost to be borne, then, this group of people may well see correcting an error as a reward, which provides enjoyment (although there is only a certain level of errors for which this is true!). So, rather than berating the person who produced this book for not auto-correcting end-of-line hyphens, I would regard it as a useful way for a proofer to get satisfaction out of improving each page, and for someone perusing the diffs to spot any people who were not 'making the grade', as everyone should know how to correct errors such as this. You may want to see if you can see proofing from this point of view, just as I have tried to see it from yours. I certainly think you should accept that your view on the costs and benefits of proofing is by no means the only valid one. -- Jon Ingram From Bowerbird at aol.com Thu Apr 10 15:57:59 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Apr 2008 18:57:59 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 09 Message-ID: jon said: > I've found it very interesting to read your recent series of posts. fancy that. > Your arguments and conclusions contradict many > of the conclusions I draw from my experience with DP and that. > both as a proofreader and a content provider. and that as well... > One reason for this may be that you have a different view to me, > and to many others, as to what the goal and aims of something > like DP should be. i don't think that's the "reason" for our different "conclusions", since my _view_ -- to the extent that i even _have_ a "view" -- has little impact (if any at all) on the data that i've presented... but go ahead... > a) that proofers will spend less time on a page > which has few errors than on a page with more errors; no, i make no such assumption. _proofing_time_ isn't a variable i've looked at. when i say that d.p. is wasting the time of its volunteers, i mean the time spent making _unnecessary_ changes... when proofers have to make _7-9_ unnecessary changes for every _1_ necessary change, it takes them more time... it also draws their attention away from that 1 change, so the odds that they will miss it are increased, but since the p1 proofers have shown _outstanding_ accuracy anyway, i haven't made a big deal out of that part of it... however, i _do_ usually say "time and energy", and i think it's clear it takes a lot more _energy_ to make corrections. every glitch you have to find and stop and fix takes a toll in the psychic energy you're devoting to the proofing job. i'll continue to speak in terms of "time", because jon has, but be aware that this is a shorthand for "time and energy". > b) that it is a good thing for proofers to spend less time on a page; again, this is outside the scope of the data i've presented. it is a good thing for proofers _not_ to have to spend time making unnecessary changes, because that time is wasted. do you disagree with me about that? and the less time proofers _waste_ on a page, the better... yes, i can definitely agree with _that_. > c) that, given a page with few errors, a proofer will > produce a page which is more perfect than they would do > given a page which has more errors; again, you don't seem to have understood the data i presented. that data does not presume to answer the question you've posed. some proofers report that a "clean" page lulls them to sleep, and i have no reason to doubt those proofers. some proofers disagree. you'd need to craft an experiment to tease out the facts of the case. i'd tend to believe both sides, and chalk it up to individual differences. nonetheless, i don't think anyone will seriously make the case that rejoining a thousand end-of-line hyphenates "keeps them alert"... (and even if they did, i doubt that the research would support them; i'd think they're confusing the busywork aspects with real progress.) and i seriously question anyone who would make the argument that we should use inferior o.c.r. programs, so they make more scannos, so the proofers will "stay awake". i mean, that just borders on _silly_. likewise, i don't think anyone would defend the creation of bad scans so as to jack up the error rate in order to keep everyone "more alert". since if that was the case, we'd use programs to "dirty up" the scans, instead of the opposite -- using programs to "clean up" the scans... but yes, jon, i do actually believe that proofers will do a better job of finding and fixing the 372 "real" errors in a book if they are relieved of the busywork of making an additional 3,000 unnecessary changes. don't you? because that's the whole point of progressive rounds, isn't it? to whittle down the number of errors faced in the later rounds? besides, if what you say is true, very few people would want to ever proof in p2 or p3. and d.p. projects would be backlogged. (this is called _irony_, folks... please look it up before replying.) > d) that proofers prefer to work on pages > with few errors than on pages with more errors; individual differences. some do. and some don't. to each his own. my workflow labels each type of proofing clearly, so people can pick. but i do believe we can all agree that, at the end of the workflow, we want books to come out with as few errors as possible. right? so _someone_, somewhere_, _sometime_, is gonna have to work with a page that is fairly clean, maybe really clean, maybe perfect. > e) that proofers find all errors equally taxing to correct; oh please. what a patently ridiculous thing to say. and you want to put that garbage in _my_ mouth? and you expect me to take this argument seriously? exercise some common sense, or just stay silent... > f) that these perpetual P1 books have been processed > in a way representative of other books in DP; well, the normal rounds they went through originally were not just "representative", they were _actual_products_of_the_actual_process_. since then, however, "planet strappers" has been absurdly atypical. and that's the only "perpetual" experiment. so, wrong again, jon... > g) that the proofing of these books is representative of > proofing on other books; it's representative of similar books, i think we can say that, yes... if you wanna prove it's _not_representative_, then present your data. first describe the different "representations" that books can take, and then sort the d.p. projects into these categories, and then give me a list of a dozen books in each category, and i will pick the ones i want to look at, and i'll show you data that is _exactly_ like the data that i've presented thus far on the books d.p. chose. and i'm surprised you'd try this "representative" dodge again, jon, when it blew up in your face the time you tried it on bookpeople... > h) that the people providing this material have > not considered any of these issues already. hard to know what anyone has "already considered", since there hasn't been much discussion anywhere... and the little discussion there _has_ been has been of a _severely-depressing_ low level of awareness... > These combine to make you form the value judgement that > > the vast majority of the time and energy that is being donated > > to distributed proofreaders by the p1 proofers is being wasted. that's right. and more specifically, it's being wasted by the _injection_ of errors by content providers, through (1) creation of inferior scans (2) the choice of inferior o.c.r. programs, (3) mishandling of the text (such as planet strappers' global change of em-dashes to en-dashes), and (4) grossly inefficient use of preprocessing clean-up programs... and i've documented -- in several different books now, all of which were _chosen_ by d.p. people, for specific use in doing experiments to gain the very type of data that i've been presenting -- that these _unnecessary_changes_ that are being _forced_ on the p1 proofers number in the _thousands_, while the actual, honest-to-goodness o.c.r. errors number in the _hundreds_ under a worthwhile workflow. and i'm saying the time and energy spent making these _unnecessary_ changes is time and energy that's being _wasted_, and that's _immoral_. i've presented solid evidence that proves the unnecessary changes, so i know you're not arguing with that. so i guess you're arguing that the actions and policies that _necessitate_ these unnecessary changes are not immoral. so ok, then, let's see your evidence for that position. because i see the flagrant waste of all that time as being immoral... nothing approaching the scale of the immorality of the war, for instance, to keep things in perspective. but still immoral enough to be troubling... > A significant number of people of people find it > harder to work on projects which are 'near perfect' > (say, less than one correction every couple of pages) > than on ones which have at least one correction to do per page. like i said, i've heard this. so what? it certainly doesn't become an argument for _injecting_errors_ into the text. not unless you do it in a _controlled_way_ which allows for their later removal. no jon, what i'm talking about is a whole lot simpler than that. it's a person at the beginning of the workflow doing crappy work that causes people down the line to have to work harder than they otherwise would have had to work, harder than they _should_ have. i'm talking about making crappy scans that cause crappy o.c.r. there's no "excuse" for that. and when you try to _make_ one, you just make yourself look bad. i'm talking about choosing a beta-level program to do the o.c.r., one that is so inferior that it makes _a_thousand_mistakes_more_ than the acknowledged leading o.c.r. program. no excuse for that. these are bad decisions that cause unnecessary work, and because of that simple fact, they are bad decisions that should be _eliminated_... and, on these books, they should have been overruled immediately, instead of letting text go to proofers who had to clean up the mess. "crappy scans will no longer be allowed." is it so hard to say that? no. is there any reason not to disallow crappy scans? not these days, no, since google has over a million scan-sets just waiting to be digitized... "o.c.r. must be done with a program that provides us decent results." any reason not to say that? not a one. it's the intelligent thing to say. "content providers may not make dumb mistakes, like global change of all em-dashes to en-dashes, and must fix it themselves if they do." again, is that hard to say? and is there a good reason _not_ to say it? > Instead of seeing every error corrected as a cost to be borne, then, > this group of people may well see correcting an error as a reward, > which provides enjoyment and we call those people masochists... :+) seriously, there are _millions_ of books that need to be digitized. there is _no_shortage_ of errors that can be corrected as rewards. so there is _no_need_ to "create more" with crappy scans and o.c.r. but tell me, jon, since you confess that you are one of these people, do you get a sense of accomplishment out of rejoining hyphenates? serious question. because if you do, i can see a way that d.p. could turn on/off the automatic rejoining of hyphenates via user preference. i mean, if rejoining keeps you alert, and gives you a sense of satisfaction, i'm certainly not advocating that d.p. should _take_that_away_ from you... but to force it on people who don't want it? that's unconscionable... > (although there is only a certain level of errors for which this is true!). i suspected as much. and i bet it's far under 3,742 errors in a book... > So, rather than berating the person who produced this book > for not auto-correcting end-of-line hyphens hold it. let's keep things straight, ok? (i know, that's kind of ironic, given that you haven't gotten _anything_ straight that i have said...) i blame d.p. "leadership" for failing to implement a workflow where end-of-line hyphens are rejoined automatically. that's d.p. policy... i don't blame the content provider for failing to rejoin hyphenates... (although, it is necessary to point out now, if _this_ content producer had done the kind of preprocessing which juliet and a good number of other content providers do for their books, the end-of-line hyphenates _would_ have been rejoined during that preprocessing. i'm just sayin'...) but d.p. policy should require a lot of preprocessing, and d.p. should make it a high priority to thoroughly update the preprocessing tools, in order to facilitate that required preprocessing, and make it standard. > I would regard it as a useful way for a proofer > to get satisfaction out of improving each page ok, i guess you've answered my question about the "reward" of that... frankly, if you feel a sense of accomplishment from doing something that a computer can do 100 times faster than you, i'm happy for you... have you lobbied juliet to stop doing the automatic rejoining? maybe you could put up a poll to see how many other proofers would join in. (boy, would i ever love to see the wording on _that_ poll!) > and for someone perusing the diffs to spot any people who > were not 'making the grade', as everyone should know > how to correct errors such as this. well, if you want to give people a kindergarten test, give them one, but don't make your proven proofers do kindergarten tests on every page. > You may want to see if you can see proofing from this point of view, > just as I have tried to see it from yours. well, first let's note that you failed abjectly at seeing my point of view... i disagreed strongly on _all_ your assumptions about what i'm thinking. and frankly, i'm not even sure i can see _your_ point of view... are you really telling me that crappy scans and crappy o.c.r. is _ok_, content provider mistakes that inject literally _thousands_ of errors into a book are _ok_, an absence of preprocessing that would relieve p1 proofers of _thousands_ of changes, and d.p. policies that require _unnecessary_changes_ to be made by the _thousands_, that all this is _ok_ because "proofers like to make lots of changes on the page"... are you _really_ telling me that, jon? :+) because if that's what you're _really_ telling me, then you betcha, i will try to wrap my brain around that strange notion, i really will... but _surely_ that's not what you're _really_ telling me, jon, is it? that p1 proofers _want_ you to make a mess out of the text, so when you send it to them they can make lots of changes? really? > You may want to see if you can see proofing from this point of view, > just as I have tried to see it from yours. I certainly think you should > accept that your view on the costs and benefits of proofing is > by no means the only valid one. well, jon, providing i hear nothing from the d.p. "leadership" in a week, i'll put together a solid plan to take my research findings to the public, and we'll see whether _they_ feel d.p. wastes the time of its volunteers. what you or i think? doesn't matter. what tens of thousands of people think? starts to matter, a little bit... but if you're thinking sloppy scans and beta-level o.c.r. programs and outright global-change mistakes by content providers will be "forgiven", let alone "cheered as the right thing", i'd say you've bet the wrong way... *** so all in all, jon, i think you're wrong. you tried to make a post on "bowerbird doesn't understand the nature of collaborative projects". and that's b.s., jon. because i _understand_ that if your collaborative project wastes the energy of its volunteers, and the general public becomes aware of this fact via solid evidence, your project is going to have a hard time attracting future volunteers... which, i guess, you won't mind. it will leave more corrections for you... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/1633d404/attachment-0001.htm From Bowerbird at aol.com Thu Apr 10 16:24:45 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Apr 2008 19:24:45 EDT Subject: [gutvol-d] turning proofing into a game Message-ID: there is a flip side to jon ingram's position in his recent post. since he didn't make the argument for that flip side very well, i'll do it for him. in the process, i'll show that not only am i not _unaware_ of the mental processes that jon mentions, but have also done thinking on how to best _utilize_ that orientation... as i hinted in my response to jon, you might want to _deliberately_ insert errors into a page, as a means of testing the alertness the proofer brought to the task... if they catch all of the errors that you deliberately inserted, you can assume that they were exercising good alertness... so if they found no further errors, you could be "more" assured that there were indeed no other errors there... (how much more assured you'd be is another question.) however, pretty much the only way you could make the _insertion_ of errors into the text palatable to proofers is to construe the job in a completely different manner, specifically transforming it from a _task_ into a _game_, where the _object_ of the game is to "catch the scannos". people could specify how many scannos they wanted on each page, say from 1-10, and you could guarantee that they had a number in that range, plus any unknown ones. if you're gonna turn it into a game, you need competition, which means another people would "play" the same page, and the most accurate person would win -- or, whenever they were equally accurate -- the fastest one would win... i don't have the web-programming skills to build a game like this -- you'd have to have lotsa _real-time_ chops -- and i'm not altogether convinced it will work as a "game", but i once posted a reference to luis von ahn to this list precisely as a means of stimulating someone to do that. -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/f9e4a477/attachment.htm From Bowerbird at aol.com Thu Apr 10 19:55:10 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Apr 2008 22:55:10 EDT Subject: [gutvol-d] here's a quick screenshot Message-ID: i always find it so humorous to have people "recommend" to me that i code up my ideas. um... all that research i report to you folks? well... breaking it to you as gently as i can... i do it with _tools_ i've coded from my ideas. here's a screenshot from one of them: > http://z-m-l.com/misc/round-by-round37.png i have a 23-inch monitor, and i use _all_ of it, so you should look at that picture at full-size to get an idea of the scope. big screens rock. this tool shows progression of the text in d.p. on top, at the left is the o.c.r. output. next to it is the p1, then the p2 and (obscured) p3 output. the lines in the o.c.r. window are colorized green if they were perfect, or red if proofers made a change. in the bottom row, the leftmost field shows changes made to the o.c.r. text by the p1 proofer, with the top line colorized red from the o.c.r., and the bottom line -- in black -- being the line as it was changed by p1... the center box shows changes that were made by p2; red lines were from p1, with black being p2 output... and the rightmost field shows changes made by p3; for the record, p3 made zero changes to this page... also, as you can see, the page-scan can be summoned in an overlay window that can be moved where i want... navigating the pages quickly, this tool gives me a "feel" for how many changes each round made on each one... plus i can get down and see specific changes if i desire. the tool does a lot more than that, too, but this is just a quick screenshot... :+) -bowerbird p.s. here's a screenshot from a page with a crappy scan: > http://z-m-l.com/misc/round-by-round36.png ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080410/9ba00e24/attachment.htm From jon.ingram at gmail.com Fri Apr 11 01:43:26 2008 From: jon.ingram at gmail.com (Jon Ingram) Date: Fri, 11 Apr 2008 09:43:26 +0100 Subject: [gutvol-d] turning proofing into a game In-Reply-To: References: Message-ID: <4baf53720804110143q4ab7b9c4m406393d24470dba8@mail.gmail.com> On Fri, Apr 11, 2008 at 12:24 AM, wrote: > there is a flip side to jon ingram's position in his recent post. > > since he didn't make the argument for that flip side very well, > i'll do it for him. Thank you for this, and the previous post. I wasn't aware I was making an argument for any position, but to the extent I did, it could certainly have been expressed better. I am aware, for example, that several people (including yourself) have suggested inserting deliberate errors into texts -- I saw several threads about this and similar topics on the DP message boads back when I monitored them closely. I could easily start a back-and-forth argument with you on this list, but I am sure I am not alone in finding mailing list threads which involve line-by-line dissections of previous posts tiring to read and unproductive. I have instead printed out your response to my email, and will read it over lunch. Apologies if this means you don't get the instant feedback you want, but as a mathematician, writing more than 2000 words a year is unusual for me :). -- Jon Ingram From Bowerbird at aol.com Fri Apr 11 09:32:15 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Apr 2008 12:32:15 EDT Subject: [gutvol-d] turning proofing into a game Message-ID: jon said: > Thank you for this, and the previous post. you're thanking me for saying your post had little substance? am i supposed to say "you're welcome" now? > I wasn't aware I was making an argument for any position well, you pointed out the importance of fostering a sense of efficacy in the proofers, creating a feeling of accomplishment. which is a good point. except i didn't need a reminder of that. indeed, to the contrary, i believe that such motivation supports _my_ position, not yours. how can a person walk away with a sense of agency when they are doing a rote mechanical task that the computer does better? i don't believe rejoining hyphenates seems all that "fulfilling" to the vast majority of proofers. i felt _terribly_inefficient_ locked into the one-page-at-a-time d.p. interface, making the same change on page after page and doing it manually, when i _knew_ that i could make that change _automatically_, across the entire _book_, with a search/replace, in a matter of seconds, without having to do any looking for cases and yet still having the confidence that i had corrected all of them. faster, easier, and with greater confidence. now _that_ is efficacy! i firmly believe d.p. people will get a bigger sense of accomplishment by digitizing more books, not by making more meaningless changes. > I am aware, for example, that several people (including yourself) > have suggested inserting deliberate errors into texts oops! hold it... i have never _recommended_ the deliberate insertion of errors. i've entertained the suggestion, but i've never recommended it, because it's too much like the busywork i'd like d.p. to eliminate. further, the only way i've even entertained the suggestion to do it is for those specific people who actually _request_ that it be done. and even then, i'd probably want to see some research results that indicate that it really does serve the purpose of keeping them alert. plus, of course, turning it into a "game", but that's a different thing. > I could easily start a back-and-forth argument with you on this list, you could _start_ one, but i don't think you could _win_ one... :+) > but I am sure I am not alone in finding mailing list threads > which involve line-by-line dissections of previous posts > tiring to read and unproductive. well, if the points are meaty, i find dialog refreshing and productive... but yeah, if they're just filled with hot air, they serve no good purpose. > I have instead printed out your response to my email, > and will read it over lunch. i probably can't recommend it as an aid to digestion, or for relaxation. so i would likely suggest enjoying some nice sunshine instead... :+) on the other hand, it would mean you had something meaty for lunch. > Apologies if this means you don't get the instant feedback you want no problem on this end. but when you have some meat, bring it back... -bowerbird ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080411/e931deb2/attachment.htm From Bowerbird at aol.com Fri Apr 11 10:17:43 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Apr 2008 13:17:43 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 10 Message-ID: since we're up to #10 in this series, time for a review... this is on "christopher and the clockmakers", used for the second parallel proofing experiment at d.p. again, we've found the pattern we've now come to expect, the pattern that seems to capture a "common-sense" take, which is that p1 fixes most of the errors, p2 gets most of the remaining ones, and p3 comes in and does clean-up. again, this is the pattern you get on page after page, in book after book, day after day, over in d.p.-land... (yes, i just keep copy/pasting those same old passages.) *** in sum, in this book, p1 changed 3,724 lines from the o.c.r. examination of these lines proved that _over_3,000_ were _unnecessary_changes_, due to bad actions or bad policy, where that means bad scans, poor choice of o.c.r. program, and insufficient (or was it even nonexistent?) preprocessing. depending on the aggressiveness of the clean-up done on it, the o.c.r. for this book should've had a couple hundred errors. perhaps it's not surprising, then -- since the p1 proofers were essentially performing "human o.c.r." on the text -- we found only 378 lines were changed during the p2 and p3 rounds... plus clean-up could have fixed half _those_ lines automatically, just for the record, but we'll have to live with what we got here... however, since p1 changed 3,724 lines, and later rounds only changed a mere 378 lines, p1 did a great job, close to perfect... *** in looking ahead, we find ourselves at a very interesting point... unlike the previous experiments in this vein analyzed so far, this book went through the "normal" p1->p2->p3 workflow and then it went through _another_ "normal" p1->p2->p3... for ease of notation, i'll label the "repeat" flow as r1->r2->r3. we've never had p2 repeated before, or p3, so that's new and could prove to be a very interesting and exciting twist on this. so the next thing we'll do is analyze the r1->r2-r3 repeat... we want to see if it follows the same pattern as the first one. we'll compare the _number_ of lines changed, in each round. then we'll see how many of the changes were _in_common_ between the p1->p2->p3 flow and the r1->r2->r3 flow, and how many of 'em were _unique_ to the separate flows. if the final output from both flows is highly similar or identical, we'll know that the "normal" workflow is getting the job done... to the degree that they differ, we will know a normal workflow is _not_ getting the job done, and perhaps we'll find out why... -bowerbird p.s. rather than let you hang _completely_ over the weekend, i will reveal now that the _number_ of lines changed in r1 was quite similar -- again, eerily! -- to the number changed in p1... specifically, p1 changed 3,742 lines, and r1 changed 3,677 lines. i've said it before, and i'll say it again: the p1 proofers really rock. ************** Planning your summer road trip? Check out AOL Travel Guides. (http://travel.aol.com/travel-guide/united-states?ncid=aoltrv00030000000016) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080411/54f8fbb8/attachment-0001.htm From gbnewby at pglaf.org Sat Apr 12 16:52:51 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Sat, 12 Apr 2008 16:52:51 -0700 Subject: [gutvol-d] Creative Commons license sample letter In-Reply-To: <1203088709.17572.51.camel@abetarda.mshome.net> References: <1203088709.17572.51.camel@abetarda.mshome.net> Message-ID: <20080412235251.GA6947@mail.pglaf.org> On Fri, Feb 15, 2008 at 03:18:29PM +0000, J?lio Reis wrote: > Hi all > > I have negotiated with the Bible Society of Portugal the publication of > a 2001 text of the Bible in Project Gutenberg. I feel this would be a > very good addition since there isn't any Portuguese Bible there. They > have gladly allowed it, and in fact I already have the XML file in my > possession. They want to release it under the Creative Commons > Attribution Non-commercial license. Sorry for not seeing this note earlier, J?lio. Please, send copyright stuff to copyright at pglaf.org (it goes to me). Your sample letter is fine. But our procedure is to "wrap" whatever is provided in the PG "small print," header, etc. So, if they want to include a copy of the CC license, they (you) should do it in the file(s) provided to PG. We have a number of such eBooks, and my opinion is that the CC license is compatible with the PG small print trademark license. -- Greg > I have a question regarding the sample letter. The example on the web > site is for releasing in the public domain only, right? So I tried > mixing that with the CC restriction, and I want to know how the example > below reads legally. The SBP is the sole proprietor of the rights, and > there are no authors other than the original translation, deceased in > the 18th century. Parts within [ ] brackets are bits I have to complete > or decide upon yet. > > * * * > > To: Michael Hart, etc. > > Lisbon, 15 February 2008 > > Dear Sir, > > We are the sole copyright holders for the book, ???B?blia Sagrada > [complete title].??? It gives us pleasure to grant Project Gutenberg > perpetual, worldwide, non-exclusive rights to distribute this book in > electronic form through Project Gutenberg Web sites, CDs or other > current and future formats. No royalties are due for these rights. > > [Use of such files is|End users of such web sites should use these > files] subject to the terms of the license Creative Commons Attribution > Non-commercial 2.5 Portugal. The full text of the license can be found > at the Internet address: > http://creativecommons.org/licenses/by-nc/2.5/pt/ > > Sincerely, > > For the Board of the Sociedade B?blica de Portugal, > Tim?teo Armelim Cavaco, Secretary-general > > * * * > > So, is that letter all right? Do we safeguard whatever it is we need to > safeguard for Gutenberg? And also, do we impose the CC licensing on > every user of PG? > > Bonus points for giving me pointers into some software to massage the > XML, preferably under Linux, or under XP. > > Thanks > > J?lio aka Tintazul. > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Mon Apr 14 01:01:07 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Apr 2008 04:01:07 EDT Subject: [gutvol-d] before we jump back in Message-ID: before we jump back in to data for "christopher and the clockmakers", let's look in on the "perpetual" test. iteration#7 is continuing on with the running joke in _fine_ fashion, with a _huge_ number of changes. there are the usual ellipse changes, and the standard dance with people inserting and/or removing asterisks on end-of-line hyphenates, and your blank lines being added or subtracted. ho hum, the normal unnecessary crap, bureaucratic changes running amok... in addition, however, _new_ wrinkles! here's a new proofer -- or perhaps a long-gone proofer newly returned -- which we notice easily because they're inserting _italics_markup_ in the text... that is considered "formatting", which is strictly verboten in "proofing" rounds. in addition, they seem to have a _very_ badly skewed understanding of the rules regarding dashes, turning the en-dashes in many compound words to em-dashes. (but, strangely, not all of them. weirdish.) well, like i said, the running joke goes on... but what? wait! what is this? on file#53, we find a p-book typo being noted. wow. it is a known error -- found first by p2 -- but i think that's the first time that one has been noted by the i-iterations. well-done! wait, there's more! _another_ p-book error -- an unbalanced quotemark on file#78 -- and this one is a _scoop_. never seen before! imagine that, folks... this text went through the normal p1->p2->p3 workflow, and then it's gone through _6_ additional iterations of perpetual p1, and nobody caught that before! good eye! congratulations! it took _9_ rounds to find and fix this error! wow! knowing that can happen, how can be ever have one bit of "confidence" that any page is perfect? -bowerbird p.s. not that we're ready to bring "formatting" into the discussion yet, but -- for the record -- i'll note that two rounds of formatters _also_ missed this p-book error. very remarkable... p.p.s. not to spoil that person's _well-deserved_ sense of satisfaction at catching a sneaky error that evaded 10 pairs of eyeballs before them, but an unbalanced quotemark is easy to detect using a clean-up tool. in fact, there _might_ be _more_ of 'em in this text. have your tool check: > http://z-m-l.com/go/plans/plans.zml ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080414/c4099f2c/attachment.htm From Bowerbird at aol.com Mon Apr 14 09:35:17 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Apr 2008 12:35:17 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 11 Message-ID: i hope you enjoyed your weekend... more data from "christopher and the clockmakers", from the second parallel proofing experiment at d.p. this book is interesting because it went through a "normal" p1->p2->p3 workflow, then _repeated_ a separate run through the "normal" p1->p2->p3, which i am labeling here as r1->r2->r3 for clarity... since both of these workflows were "normal" ones, i will call the two separate projects different "flows". *** so, our first question will be, "were the flows _similar_?" here were the lines changed during the p1->p2->p3 flow: > p1 changed 3,724 o.c.r. lines, mostly bureaucratic. > p2 corrected real o.c.r. errors in 128 lines from p1. > p3 corrected real o.c.r. errors in 25 lines from p2... the actual changed line-pairs: > http://z-m-l.com/go/chris/chris-p1-3724changes.html > http://z-m-l.com/go/chris/chris-p2-0128changes.html > http://z-m-l.com/go/chris/chris-p3-0025changes.html here were the lines changed during the r1->r2->r3 flow: > r1 changed 3,677 o.c.r. lines, mostly bureaucratic. > r2 corrected real o.c.r. errors in 146 lines from r1. > r3 corrected real o.c.r. errors in 23 lines from r2... the actual changed line-pairs: > http://z-m-l.com/go/chris/chris-r1-3677changes.html > http://z-m-l.com/go/chris/chris-r2-0146changes.html > http://z-m-l.com/go/chris/chris-r3-0023changes.html as you can easily tell, those numbers are _extremely_ similar. both go from thousands of changes, to hundreds, to dozens. so once again here, we've observed the expected pattern, the pattern that seems to capture a "common-sense" take, which is that p1 fixes most of the errors, p2 gets most of the remaining ones, and p3 comes in and does clean-up. again, this is the pattern you get on page after page, in book after book, day after day, over in d.p.-land... so it's not surprising that we'd get it again... p1 rocks... *** here too, we get eerie similarities with the numbers... in checking, though, i found that there is little overlap in the _errors_ that came out of the p1 and r1 rounds. there is a huge number of identical lines coming out, but that was due to lines rendered perfectly by both... i find only about 2 dozen _erroneous_ line-pairs that came out of p1 and r1 identical, and all of them were _untouched_. that is, they came out of o.c.r. wrong... and i found only 1 pair of identical _erroneous_ lines coming out of the p2 and r2 rounds, and -- like the p1/r1 line-pairs, it was untouched. considering that there were well over 3,800 lines that were changed in the p1/r2 and p1/r2 rounds, this _tiny_ overlap in their identical proofing errors is truly extremely insignificant. *** and now we get into the interesting data, the data that we've never had before, where p2 and p3 are repeated. their numbers were very similar, as we observed above. what about analyzing the actual quality of their output? first, let's compare the final output from both flows... the p3 output and the r3 output differed on 21 lines: > http://z-m-l.com/go/chris/chris-21flowdiffs.html although both of the workflows did acceptably well here -- easily passing my standard for continuous proofing -- _neither_ of these outputs reflects _perfectly_ on its flow, and thus is unlikely to make the obsessives at d.p. happy... p1->p2->p3 left 8 o.c.r. errors, and missed 2 p-book errors. r1->r2->r3 left 13 o.c.r. mistakes, but found 2 p-book errors. the actual data: > http://z-m-l.com/go/chris/chris-p-flow-misses8.html > http://z-m-l.com/go/chris/chris-r-flow-misses13.html both flows might have left more errors, errors in common, which -- because they're matching -- don't show up here... it's unlikely that there are many of those, but we don't know. but we _do_know_ they have the number of errors i listed, because those are the lines that _differ_ between the two... _one_ of them has to be wrong. (and it's fairly easy to tell.) this is disappointing... the results of 3 rounds of proofing don't seem to be sufficient to create an error-free e-book. at least not when the 3 rounds are the normal p1->p2->p3. or the equally normal -- just relabeled here -- r1->r2->r3. ouch... or perhaps i should say "double-ouch"... *** but the message isn't completely bleak, by any means. 8-13 errors after proofing is _not_ a cause for despair. especially when most of them can be seen to be easily detected by even a relatively feeble clean-up program... and some of these are clearly due to the use of tesseract, and others are likely also caused by that, so that's a relief, since it means they could've been avoided by using abbyy. (and one message should be clear to all: tesseract sucks!) moreover, as many of the undetectable problems involved misrecognized "punctuation" at the end of a line, we might find that the "noise" on the scans was responsible for them, in which case a better job of making the scans would help... i am also buoyed by the thought that continuous proofing by the general public will eventually find and fix all errors... but if you've pinned your hopes for perfection on proofers, especially the p2 and p3 proofers, they let you down here... let me repeat, though, that i think it's folly to expect that human proofers will deliver perfection. the human mind serves us so well precisely _because_ it can gloss over the peculiar anomalies that populate our natural environment. 8-13 proofing errors in a book of this size is phenomenal. i know people want to do better. the question: at what cost? *** plus, as i've reconciled these differences, we have created a darn good book. the flows proved to be independent, so we essentially have two different parallel digitizations; they largely agree, and act as a check against each other... most errors that one flow missed were found by the other. to sum it up, our 6 rounds of proofing created a great book. i don't know if it's _perfect_, but even if it isn't, it's darn close. (well, actually, i know it's _not_ perfect, since i found an error, but it was an error in the p-book, so we don't have to count it. still, i would be surprised if there's over 6 errors in it now, and i'd be shocked if there's more than 12. but you never know...) but, um, we _really_ don't wanna have to do 6 proofing rounds. _especially_ if that means 2 rounds of p2 and 2 rounds of p3... (and even with all that, we _still_ couldn't know that it's perfect.) what was done on this book is wholly unrealistic as a method. in sum, it's not the number of errors left that is the problem... it's the fact that, to reduce the number to that, it took 6 rounds. so, what can we do? well, let's look at things a bit more closely. *** remember how, back with planet strappers, two rounds of p1 proved to be as effective as the normal flow of p1->p2->p3? i wanted to see if that might be the case here. so, specifically, for the errors missed by each of the workflows, i wanted to see which round of proofers had caught them in the opposing flow. if the x1 proofers in the other flow caught them, that'd be rad... of the 8 errors that the p1->p2->p3 workflow missed: > 1 was caught by r1 > 0 were caught by r2 > 7 were caught by r3 > http://z-m-l.com/go/chris/chris-p-flow-misses8.html of the 13 errors that the r1->r2->r3 workflow missed: > 4 were caught by p1 > 1 was caught by p2 > 8 were caught by p3 > http://z-m-l.com/go/chris/chris-r-flow-misses13.html so, of 21 errors, 5 were caught by x1 proofers in the other flow, and 1 of those 21 was caught by a x2 proofer in the other flow... for x1 to catch 5 of the 21 is pretty darn good, but it ain't perfect. so it appears my "quick fix" does not seem to have worked here... 3/4 of the errors (15 of 21) persisted until x3 in the other flow... that means that there were a grand total of _15_ "sneaky" errors. that's my technical term for errors that were _extremely_difficult_ to catch. the 15 sneaky errors were missed by 5 rounds out of 6, including 2 rounds of x1, 2 rounds of x2, and even 1 round of x3. it was only the _other_ round of x3 that finally found and fixed 'em. there is some good news here, at least for some of us d.p. critics. in spite of their "sneaky" nature, 10 of these 15 errors are _easily_ detected by most clean-up tools, using well-recognized routines. but let's stick with the human angle which d.p. seems to prefer... these 15 "sneaky" errors withstood 5 rounds of human eyeballs... if you're a believer that some errors are _so_difficult_ to catch that only the most expert of proofers can catch em, _these_are_them..._ i've listed the "sneaky" errors here, so you can meet the enemy: > http://z-m-l.com/go/chris/chris-p-flow-misses8.html > http://z-m-l.com/go/chris/chris-r-flow-misses13.html the problem is, even with x3 "marines", _half_ of these sneakies were missed. the p3 "marines" missed _7_ errors found by r3, and the r3 "marines" missed _8_ errors the p3 proofers caught... so, yes, only the marines were able to catch these sneaky errors -- dashing my hopes that x1 would give us an upset miracle -- but even the marines missed _half_ the sneaky errors they faced. that's not good news. sneaky errors have a 50/50 chance even against expert proofers. as in baseball, "good pitching beats good hitting every time, and vice versa..." expert proofers beat sneaky errors, _and_ vice versa. another problem is that, if you _look_ at the sneaky errors, it's hard to see how they differ from _hundreds_ of similar errors in this o.c.r. so even if we do not question that they earned their title as "sneaky", it's hard to say just exactly what it is that _makes_ them so "sneaky". they look like errors that would be easy to find! they look like errors that _were_ found! they look _exactly_like_ errors that were found! whatever the case, we see sneaky errors can withstand _5_rounds_ of proofing, even from the _best_ proofers we can throw at them, and in numbers that make the perfectionists among us unhappy... and as i noted, one sneaky error in the "planet strappers" test lived through _10_pairs_of_eyeballs_ before it met its demise. so "more human proofing" doesn't appear to be the answer... *** so, what to do? well, i've left you enough breadcrumbs in this message to give you a very good idea on where to go next... who can tell me where it is? -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080414/db1353e3/attachment-0001.htm From Bowerbird at aol.com Mon Apr 14 11:41:43 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Apr 2008 14:41:43 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: wow. so i see that -- after manybooks scarfed it, and boingboing promoted it -- y'all have decided to post "planet strappers"... in view of all the work done on it, it would've been nice if you could've posted an error-free version of it. oh well... anyway, here are the 11 errors that i find wrong... > was moving only about five hundred feet from it's companion, GO-11. But, > was moving only about five hundred feet from its companion, GO-11. But, > "Nuts--let's give this sick rat to the Space Force right now." Art Kuzak > "Nuts--let's give this sick rat to the Space Force right now," Art Kuzak > sent the party? I saw where there rocket ship must have stood--a glassy, > sent the party? I saw where their rocket ship must have stood--a glassy > must be a new, popular song. He had heard so few new songs. > must be a new popular song. He had heard so few new songs. > threat of slow dying, an ordeal, as the sagging dome was torn from above > threat of slow dying, in ordeal, as the sagging dome was torn from above > lunar wilderness... What a switch--didn't think you'd goof! The > lunar wilderness... What a sitch--didn't think you'd goof! The > highway. There were other rough stretches, but most of the well selected > highway. There were other rough stretches, but most of the well-selected > idea of where the Kuzaks' supply post was, and the dizzying distance to > idea of where the Kuzak's supply post was, and the dizzying distance to > grubbed it, yourself? Sell it. Get the stink blown off you--forget some > grubbed it yourself? Sell it. Get the stink blown off you--forget some and these lines need to be tightened: > "Serene... > Found a queen... > And her name is Eileen..." (by the way, these lines also answered today's earlier question about whether there were more unbalanced doublequotemarks in the file... as separate paragraphs, as the current file indicates they would be, their quotemarks would be unbalanced. however, we can tell they're intended as one coherent block, and so should be structured as such.) note that two corrections were made in the "sent the party" line... and i would've left "in ordeal" as is, rather than changing it, as you did. and "sitch" -- which you changed to "switch" -- is shorthand for "situation". finally, the comma after "grubbed it" was in the original, but i think it's an error. it's great to see that a whitewasher can stop a perpetual proofing machine... but i hope it wasn't al who posted that; 10 errors won't make him happy. ;+) -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080414/fa30cc8c/attachment.htm From gbnewby at pglaf.org Mon Apr 14 13:32:07 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Mon, 14 Apr 2008 13:32:07 -0700 Subject: [gutvol-d] whitewasher stops perpetual proofing machine In-Reply-To: References: Message-ID: <20080414203207.GA5251@mail.pglaf.org> Thanks for this errata report. I'm forwarding it to the folks who can take a look & fix it. -- Greg On Mon, Apr 14, 2008 at 02:41:43PM -0400, Bowerbird at aol.com wrote: > wow. so i see that -- after manybooks scarfed it, and boingboing promoted > it -- > y'all have decided to post "planet strappers"... in view of all the work > done on it, > it would've been nice if you could've posted an error-free version of it. > oh well... > > anyway, here are the 11 errors that i find wrong... > > > was moving only about five hundred feet from it's companion, GO-11. But, > > was moving only about five hundred feet from its companion, GO-11. But, > > > "Nuts--let's give this sick rat to the Space Force right now." Art Kuzak > > "Nuts--let's give this sick rat to the Space Force right now," Art Kuzak > > > sent the party? I saw where there rocket ship must have stood--a glassy, > > sent the party? I saw where their rocket ship must have stood--a glassy > > > must be a new, popular song. He had heard so few new songs. > > must be a new popular song. He had heard so few new songs. > > > threat of slow dying, an ordeal, as the sagging dome was torn from above > > threat of slow dying, in ordeal, as the sagging dome was torn from above > > > lunar wilderness... What a switch--didn't think you'd goof! The > > lunar wilderness... What a sitch--didn't think you'd goof! The > > > highway. There were other rough stretches, but most of the well selected > > highway. There were other rough stretches, but most of the well-selected > > > idea of where the Kuzaks' supply post was, and the dizzying distance to > > idea of where the Kuzak's supply post was, and the dizzying distance to > > > grubbed it, yourself? Sell it. Get the stink blown off you--forget some > > grubbed it yourself? Sell it. Get the stink blown off you--forget some > > and these lines need to be tightened: > > "Serene... > > Found a queen... > > And her name is Eileen..." > (by the way, these lines also answered today's earlier question about > whether there were more unbalanced doublequotemarks in the file... > as separate paragraphs, as the current file indicates they would be, > their quotemarks would be unbalanced. however, we can tell they're > intended as one coherent block, and so should be structured as such.) > > note that two corrections were made in the "sent the party" line... > and i would've left "in ordeal" as is, rather than changing it, as you did. > and "sitch" -- which you changed to "switch" -- is shorthand for "situation". > finally, the comma after "grubbed it" was in the original, but i think it's > an error. > > it's great to see that a whitewasher can stop a perpetual proofing machine... > > but i hope it wasn't al who posted that; 10 errors won't make him happy. > ;+) > > -bowerbird > > > > ************** > It's Tax Time! Get tips, forms and advice on AOL Money & > Finance. > (http://money.aol.com/tax?NCID=aolcmp00300000002850) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Mon Apr 14 16:35:16 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Apr 2008 19:35:16 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: greg said: > Thanks for this errata report.? I'm forwarding it > to the folks who can take a look & fix it. if i woulda known someone would act, i woulda given these links to the scans: > was moving only about five hundred feet from it's companion, GO-11. But, > was moving only about five hundred feet from its companion, GO-11. But, > http://z-m-l.com/go/plans/plansp044.html > "Nuts--let's give this sick rat to the Space Force right now." Art Kuzak > "Nuts--let's give this sick rat to the Space Force right now," Art Kuzak > http://z-m-l.com/go/plans/plansp051.html > sent the party? I saw where there rocket ship must have stood--a glassy, > sent the party? I saw where their rocket ship must have stood--a glassy > http://z-m-l.com/go/plans/plansp062.html > must be a new, popular song. He had heard so few new songs. > must be a new popular song. He had heard so few new songs. > http://z-m-l.com/go/plans/plansp068.html > threat of slow dying, an ordeal, as the sagging dome was torn from above > threat of slow dying, in ordeal, as the sagging dome was torn from above > http://z-m-l.com/go/plans/plansp070.html > lunar wilderness... What a switch--didn't think you'd goof! The > lunar wilderness... What a sitch--didn't think you'd goof! The > http://z-m-l.com/go/plans/plansp073.html > highway. There were other rough stretches, but most of the well selected > highway. There were other rough stretches, but most of the well-selected > http://z-m-l.com/go/plans/plansp075.html i do believe i've switched my mind on this one. i think _kuzaks'_ is correct. > idea of where the Kuzaks' supply post was, and the dizzying distance to > idea of where the Kuzak's supply post was, and the dizzying distance to > http://z-m-l.com/go/plans/plansp089.html > grubbed it, yourself? Sell it. Get the stink blown off you--forget some > grubbed it yourself? Sell it. Get the stink blown off you--forget some > http://z-m-l.com/go/plans/plansp110.html > and these lines need to be tightened: > "Serene... > Found a queen... > And her name is Eileen..." > http://z-m-l.com/go/plans/plansp068.html -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080414/f05f8686/attachment.htm From ajhaines at shaw.ca Mon Apr 14 20:36:16 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Mon, 14 Apr 2008 20:36:16 -0700 Subject: [gutvol-d] whitewasher stops perpetual proofing machine References: Message-ID: <002f01c89ea9$dc9a23d0$6501a8c0@ahainesp2400> Finally getting around to catching up on my e-mail, and I see my name has been dropped again. In fact, I *did* handle this submission. It should be pointed out, quite emphatically, that none of the "errors" mentioned below would have been found in that submission by anything less than a full, expert, smooth-read, or by a line-by-line comparison with another version of the book. Neither of these options is practical for the bulk of submissions, and certainly not for me, as Whitewasher. Anyone who thinks they are is sadly misinformed, and anyone who insists that either be part of the Whitewashing process can have my resignation as Whitewasher. I will not speak for the other Whitewashers. Greg may do so if he wishes. I'll address the "errors" in bowerbird's other message. Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Monday, April 14, 2008 11:41 AM Subject: [gutvol-d] whitewasher stops perpetual proofing machine wow. so i see that -- after manybooks scarfed it, and boingboing promoted it -- y'all have decided to post "planet strappers"... in view of all the work done on it, it would've been nice if you could've posted an error-free version of it. oh well... anyway, here are the 11 errors that i find wrong... > was moving only about five hundred feet from it's companion, GO-11. But, > was moving only about five hundred feet from its companion, GO-11. But, > "Nuts--let's give this sick rat to the Space Force right now." Art Kuzak > "Nuts--let's give this sick rat to the Space Force right now," Art Kuzak > sent the party? I saw where there rocket ship must have stood--a glassy, > sent the party? I saw where their rocket ship must have stood--a glassy > must be a new, popular song. He had heard so few new songs. > must be a new popular song. He had heard so few new songs. > threat of slow dying, an ordeal, as the sagging dome was torn from above > threat of slow dying, in ordeal, as the sagging dome was torn from above > lunar wilderness... What a switch--didn't think you'd goof! The > lunar wilderness... What a sitch--didn't think you'd goof! The > highway. There were other rough stretches, but most of the well selected > highway. There were other rough stretches, but most of the well-selected > idea of where the Kuzaks' supply post was, and the dizzying distance to > idea of where the Kuzak's supply post was, and the dizzying distance to > grubbed it, yourself? Sell it. Get the stink blown off you--forget some > grubbed it yourself? Sell it. Get the stink blown off you--forget some and these lines need to be tightened: > "Serene... > Found a queen... > And her name is Eileen..." (by the way, these lines also answered today's earlier question about whether there were more unbalanced doublequotemarks in the file... as separate paragraphs, as the current file indicates they would be, their quotemarks would be unbalanced. however, we can tell they're intended as one coherent block, and so should be structured as such.) note that two corrections were made in the "sent the party" line... and i would've left "in ordeal" as is, rather than changing it, as you did. and "sitch" -- which you changed to "switch" -- is shorthand for "situation". finally, the comma after "grubbed it" was in the original, but i think it's an error. it's great to see that a whitewasher can stop a perpetual proofing machine... but i hope it wasn't al who posted that; 10 errors won't make him happy. ;+) -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080414/909b9585/attachment.htm From ajhaines at shaw.ca Mon Apr 14 20:50:48 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Mon, 14 Apr 2008 20:50:48 -0700 Subject: [gutvol-d] whitewasher stops perpetual proofing machine References: Message-ID: <000601c89eab$e42b0540$6501a8c0@ahainesp2400> I've looked at all the page scans mentioned below. I assume they came from DP, but that's only an assumption. I'll verify that when Joshua Hutchinson asks the Whitewashers to check his next batch of page scans from DP. All the second lines in the pairs below appear to have come from another version of this etext, unless the version I posted yesterday (Sunday) has been harvested from PG, reproofed, and put online elsewhere in the last 24 hours. I *was* going to address each of the items below, but I've got far too much else to do, so I've decided that there's no point going further when I've got only a handful of pagescans of unconfirmed origin, and don't have whatever other version of the book is out there and *its* pagescans. Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Monday, April 14, 2008 4:35 PM Subject: Re: [gutvol-d] whitewasher stops perpetual proofing machine greg said: > Thanks for this errata report. I'm forwarding it > to the folks who can take a look & fix it. if i woulda known someone would act, i woulda given these links to the scans: > was moving only about five hundred feet from it's companion, GO-11. But, > was moving only about five hundred feet from its companion, GO-11. But, > http://z-m-l.com/go/plans/plansp044.html > "Nuts--let's give this sick rat to the Space Force right now." Art Kuzak > "Nuts--let's give this sick rat to the Space Force right now," Art Kuzak > http://z-m-l.com/go/plans/plansp051.html > sent the party? I saw where there rocket ship must have stood--a glassy, > sent the party? I saw where their rocket ship must have stood--a glassy > http://z-m-l.com/go/plans/plansp062.html > must be a new, popular song. He had heard so few new songs. > must be a new popular song. He had heard so few new songs. > http://z-m-l.com/go/plans/plansp068.html > threat of slow dying, an ordeal, as the sagging dome was torn from above > threat of slow dying, in ordeal, as the sagging dome was torn from above > http://z-m-l.com/go/plans/plansp070.html > lunar wilderness... What a switch--didn't think you'd goof! The > lunar wilderness... What a sitch--didn't think you'd goof! The > http://z-m-l.com/go/plans/plansp073.html > highway. There were other rough stretches, but most of the well selected > highway. There were other rough stretches, but most of the well-selected > http://z-m-l.com/go/plans/plansp075.html i do believe i've switched my mind on this one. i think _kuzaks'_ is correct. > idea of where the Kuzaks' supply post was, and the dizzying distance to > idea of where the Kuzak's supply post was, and the dizzying distance to > http://z-m-l.com/go/plans/plansp089.html > grubbed it, yourself? Sell it. Get the stink blown off you--forget some > grubbed it yourself? Sell it. Get the stink blown off you--forget some > http://z-m-l.com/go/plans/plansp110.html > and these lines need to be tightened: > "Serene... > Found a queen... > And her name is Eileen..." > http://z-m-l.com/go/plans/plansp068.html -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080414/f154d401/attachment-0001.htm From grythumn at gmail.com Mon Apr 14 21:11:06 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Tue, 15 Apr 2008 00:11:06 -0400 Subject: [gutvol-d] whitewasher stops perpetual proofing machine In-Reply-To: <000601c89eab$e42b0540$6501a8c0@ahainesp2400> References: <000601c89eab$e42b0540$6501a8c0@ahainesp2400> Message-ID: <15cfa2a50804142111g991877bi21d6de381e578c4a@mail.gmail.com> The actual project posted to 25067 is http://www.pgdp.net/c/project.php?id=projectID46a94b4a1838b I glanced through a few of these, and of those I checked the items marked below as "errors" are actually faithful reproductions of the original text. It looks like the reporter didn't bother checking the errata list against the original page images. R C On Mon, Apr 14, 2008 at 11:50 PM, Al Haines (shaw) wrote: > I've looked at all the page scans mentioned below. I assume they came > from DP, but that's only an assumption. I'll verify that when Joshua > Hutchinson asks the Whitewashers to check his next batch of page scans from > DP. > > All the second lines in the pairs below appear to have come from another > version of this etext, unless the version I posted yesterday (Sunday) has > been harvested from PG, reproofed, and put online elsewhere in the last 24 > hours. > > I *was* going to address each of the items below, but I've got far too > much else to do, so I've decided that there's no point going further when > I've got only a handful of pagescans of unconfirmed origin, and don't have > whatever other version of the book is out there and *its* pagescans. > > Al > > > ----- Original Message ----- > *From:* Bowerbird at aol.com > *To:* gutvol-d at lists.pglaf.org ; Bowerbird at aol.com > *Sent:* Monday, April 14, 2008 4:35 PM > *Subject:* Re: [gutvol-d] whitewasher stops perpetual proofing machine > > greg said: > > Thanks for this errata report. I'm forwarding it > > to the folks who can take a look & fix it. > > > if i woulda known someone would act, i woulda given these links to the > scans: > > > was moving only about five hundred feet from it's companion, GO-11. > But, > > was moving only about five hundred feet from its companion, GO-11. > But, > > http://z-m-l.com/go/plans/plansp044.html > > > "Nuts--let's give this sick rat to the Space Force right now." Art > Kuzak > > "Nuts--let's give this sick rat to the Space Force right now," Art > Kuzak > > http://z-m-l.com/go/plans/plansp051.html > > > sent the party? I saw where there rocket ship must have stood--a > glassy, > > sent the party? I saw where their rocket ship must have stood--a > glassy > > http://z-m-l.com/go/plans/plansp062.html > > > must be a new, popular song. He had heard so few new songs. > > must be a new popular song. He had heard so few new songs. > > http://z-m-l.com/go/plans/plansp068.html > > > threat of slow dying, an ordeal, as the sagging dome was torn from > above > > threat of slow dying, in ordeal, as the sagging dome was torn from > above > > http://z-m-l.com/go/plans/plansp070.html > > > lunar wilderness... What a switch--didn't think you'd goof! The > > lunar wilderness... What a sitch--didn't think you'd goof! The > > http://z-m-l.com/go/plans/plansp073.html > > > highway. There were other rough stretches, but most of the well > selected > > highway. There were other rough stretches, but most of the > well-selected > > http://z-m-l.com/go/plans/plansp075.html > > i do believe i've switched my mind on this one. i think _kuzaks'_ is > correct. > > idea of where the Kuzaks' supply post was, and the dizzying distance > to > > idea of where the Kuzak's supply post was, and the dizzying distance > to > > http://z-m-l.com/go/plans/plansp089.html > > > grubbed it, yourself? Sell it. Get the stink blown off you--forget > some > > grubbed it yourself? Sell it. Get the stink blown off you--forget some > > http://z-m-l.com/go/plans/plansp110.html > > > and these lines need to be tightened: > > "Serene... > > Found a queen... > > And her name is Eileen..." > > http://z-m-l.com/go/plans/plansp068.html > > -bowerbird > > > > ************** > It's Tax Time! Get tips, forms and advice on AOL Money & Finance. > (http://money.aol.com/tax?NCID=aolcmp00300000002850) > > ------------------------------ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080415/935a9238/attachment.htm From Bowerbird at aol.com Mon Apr 14 23:19:37 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Apr 2008 02:19:37 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: al said: > It should be pointed out, quite emphatically, > that none of the "errors" mentioned below > would have been found in that submission > by anything less than a full, expert, > smooth-read, or by a line-by-line comparison > with another version of the book.? of course, al. if you haven't been following my many posts on this book, which has been one of the tests being run over at d.p., you wouldn't know that all of these errors have been discussed in detail. that's the only reason i knew about them. nobody wants your resignation. it's just a little error-report. don't over-react to it... a winky-smiley means "all in good fun..." ;+) -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080415/2e4fb1f9/attachment.htm From Bowerbird at aol.com Mon Apr 14 23:49:07 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Apr 2008 02:49:07 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: robert said: > I glanced through a few of these, and of those > I checked the items marked below as "errors" > are actually faithful reproductions of the original text. > It looks like the reporter didn't bother checking > the errata list against the original page images. sometimes it's hard to keep a sense of humor here. even with a winky-smiley... ;+) robert, what kind of crap are you talking about here? do you not know that this is the book which has been the subject of the "perpetual p1 experiment" at d.p.? have you not read the _many_ messages i have posted detailing (in an excruciating way) the corrections that have been made to this book, in the normal workflow and its _seven_ (7, count 'em) iterations through p1? yes, indeed, some of the "corrections" that i listed there are indeed _corrections_ to _errors_ in the paper-book. evidently you didn't notice that the text as it exists now _also_ corrects many additional errors in the original... a "transcriber's note" details over two dozen corrections. if you want to discuss each and every one of mine, fine! (i'd really like to know how you _decide_ "which" errors you're going to correct, and which you will leave as-is. that will be a humorous discussion to have, you betcha.) and besides, there are some places where the text now _misrepresents_ what is on the scan. what about that? are we doing "actual faithful reproductions"? or not? either way, for you to suggest i "didn't bother" to "check" against the page-images is just plain _asinine_, since i have pored over that text and those scans for months. those scans are on my hard-drive, and on my site too... do you really think i gave al the links without checking them myself? what kind of a blooming idiot _are_ you? what _really_ happened is you're not reading my posts, -- including this! -- so you end up saying stupid shit... :+) you say things indicating you think i have no knowledge of that text, when i know every wrinkle, inside and out... :+) you have no idea how ignorant you are, and yet you're proving it beyond a shadow of a doubt to everyone else. oh, hey wait, there _is_ something funny about that... :+) maybe it's easy to keep a sense of humor here after all... ;+) -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080415/5d0d2a32/attachment.htm From greg at durendal.org Tue Apr 15 04:20:20 2008 From: greg at durendal.org (Greg Weeks) Date: Tue, 15 Apr 2008 07:20:20 -0400 (EDT) Subject: [gutvol-d] whitewasher stops perpetual proofing machine In-Reply-To: <000601c89eab$e42b0540$6501a8c0@ahainesp2400> References: <000601c89eab$e42b0540$6501a8c0@ahainesp2400> Message-ID: On Mon, 14 Apr 2008, Al Haines (shaw) wrote: > I *was* going to address each of the items below, but I've got far too > much else to do, so I've decided that there's no point going further > when I've got only a handful of pagescans of unconfirmed origin, and > don't have whatever other version of the book is out there and *its* > pagescans. Don't worry about it. The perpetual P1 and the PG track were seperate projects at DP. When the perpetual P1 project finishes, which it hasn't, the manager will submit a correct errata report. By that time the correct page scans should be in PG to check against. -- Greg Weeks http://durendal.org:8080/greg/ From Bowerbird at aol.com Tue Apr 15 08:59:17 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Apr 2008 11:59:17 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: greg said: > Don't worry about it. The perpetual P1 and > the PG track were seperate projects at DP. > When the perpetual P1 project finishes, which it hasn't, > the manager will submit a correct errata report. when _will_ the perpetual project finish? and is there any reason to tolerate _known_ errors in the project gutenberg version at the present time? i really try hard to keep my obsessive perfectionism in check, knowing that the proofers are only human, and volunteers at that. errors happen. i accept that. but the lackadaisical "don't worry about it" attitude toward _known_ errors is rather difficult to accept. and i know the whitewashers are volunteers, too, and extremely busy ones at that... but in my humble opinion, error reports should get _highest_ priority, and that precedence should be made as clear and explicit as possible to the public, to start offsetting the public opinion that p.g. e-texts are error-ridden. (and if you don't understand that that's what the public opinion is, you ain't listening.) as it just so happened, when boingboing wrote up the presence of this book over at manybooks.net last friday, whatever version happened to be there at the time -- with whatever warts it had -- got _maximum_exposure_, likely more than this book will probably get in the next year of its existence... i think it woulda been nice if all those downloads would have been of a clean, error-free version... -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080415/03fded6f/attachment-0001.htm From hart at pglaf.org Tue Apr 15 10:25:36 2008 From: hart at pglaf.org (Michael Hart) Date: Tue, 15 Apr 2008 10:25:36 -0700 (PDT) Subject: [gutvol-d] !@! Re: whitewasher stops perpetual proofing machine In-Reply-To: References: Message-ID: I have some yay and some nay about all this, as below. Michael On Tue, 15 Apr 2008, Bowerbird at aol.com wrote: > greg said: >> Don't worry about it. The perpetual P1 and >> the PG track were seperate projects at DP. >> When the perpetual P1 project finishes, which it hasn't, >> the manager will submit a correct errata report. > > when _will_ the perpetual project finish? > Given the numbers of books and readers I hope we have: NEVER! > and is there any reason to tolerate _known_ errors > in the project gutenberg version at the present time? > You've herad me say many times that I do NOT believe in "canonical errors." Just fix them. Leave a note, AT THE END, if you want, but don't INTERRUPT the actual reading to show off that you know where an error is/was. . .that's just pomposity. > i really try hard to keep my obsessive perfectionism > in check, knowing that the proofers are only human, > and volunteers at that. errors happen. i accept that. > Having SAID that, ACT like it. Actions speak more loudly than words. > but the lackadaisical "don't worry about it" attitude > toward _known_ errors is rather difficult to accept. > If we presume a great many readers throughout the years, there is not reason NOT to presume errors will be ID'd, and then properly dealt with. > and i know the whitewashers are volunteers, too, > and extremely busy ones at that... > > but in my humble opinion, error reports should get > _highest_ priority, and that precedence should be _I_ think _highest priority_ should be NEW BOOKS. I just refusee to make our top concern the 1% top elite who like to show off they can spot the errors [ though I appreadiate their error reports more than anyone ] or the 1%at the bottom who can't tell what the books mean if the errors are included. I prefer to work with the vast majority in mind, not the exceptions at either end. > made as clear and explicit as possible to the public, > to start offsetting the public opinion that p.g. e-texts > are error-ridden. (and if you don't understand that > that's what the public opinion is, you ain't listening.) > "public opinion" sways in the breeze like fields of wheat. If you use public opinion as your compass, you will never have anything resembling a valid compass. you may get electyed to public office, but you will be at the tender mercies of a VERY fickle public. > as it just so happened, when boingboing wrote up > the presence of this book over at manybooks.net > last friday, whatever version happened to be there > at the time -- with whatever warts it had -- got > _maximum_exposure_, likely more than this book > will probably get in the next year of its existence... > So be happy that your cup overflowth. . . . > i think it woulda been nice if all those downloads > would have been of a clean, error-free version... > Rather then unhappy that you already know there is no "clean, error-free version." You can alwyas do more errors, but not always more PD boosk. > -bowerbird > > > > ************** > It's Tax Time! Get tips, forms and advice on AOL Money & > Finance. > (http://money.aol.com/tax?NCID=aolcmp00300000002850) > From piggy at netronome.com Tue Apr 15 10:53:26 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Tue, 15 Apr 2008 13:53:26 -0400 Subject: [gutvol-d] !@! Re: whitewasher stops perpetual proofing machine In-Reply-To: References: Message-ID: <4804EB96.10200@netronome.com> Michael Hart wrote: > I have some yay and some nay about all this, as below. > > Michael > > > On Tue, 15 Apr 2008, Bowerbird at aol.com wrote: > > >> greg said: >> >>> Don't worry about it. The perpetual P1 and >>> the PG track were seperate projects at DP. >>> When the perpetual P1 project finishes, which it hasn't, >>> the manager will submit a correct errata report. >>> >> when _will_ the perpetual project finish? >> >> > > Given the numbers of books and readers I hope we have: > > NEVER! > I think the poster is referring to a specific experimental project at PGDP. I thought I had the termination criteria listed in the wiki, http://www.pgdp.net/wiki/Confidence_in_Page_analysis#Perpetual_P1 , but I see that I don't. They're somewhere in the CiP thread. I'll add them during my next big round of wiki edits. I expect at least one more round of this experiment, possibly as many as five more rounds. That will be followed by at least one round each of P2 and P3. We have four more Perpetual P1 experiments planned. See http://www.pgdp.net/wiki/Confidence_in_Page_analysis#Non-statistical_Help_Items . From gbnewby at pglaf.org Tue Apr 15 12:30:46 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Tue, 15 Apr 2008 12:30:46 -0700 Subject: [gutvol-d] handling errata In-Reply-To: References: Message-ID: <20080415193046.GA22999@mail.pglaf.org> (Changing the subject line, since this has nothing to do with DP's proofreading rounds) On Tue, Apr 15, 2008 at 11:59:17AM -0400, Bowerbird at aol.com wrote: > ... > but in my humble opinion, error reports should get > _highest_ priority, and that precedence should be > made as clear and explicit as possible to the public, > to start offsetting the public opinion that p.g. e-texts > are error-ridden. As Michael just wrote, I'm confirming, with detail: Fixing errors is not PG's highest priority. Sorry. (We will VERY STRONGLY support anyone who wants to create some sort of project, affiliate site, etc. to receive & process errata. I can easily imagine that such an effort could feed back changed eBook files to the 'main' PG.) I've written about the challenges of handling errata before, and will briefly summarize here. It's a very tough task. We have about 400 errata reports pending (some are duplicates, follow-ups, etc.), and none are in danger of being lost or forgotten. We very often receive incomplete or incorrect error reports (such as to fix spelling or punctuation that is not actually incorrect). We also get some wacky stuff, like people who want to censor particular passages, or who are wondering why there are no images in a .txt file. It all goes into the hopper for research & response. Even for complete & correct error reports, the reports still need to be applied. Sometimes we need to access page scans or do further research, but of course many times the errors and their fixes are clear. As a general practice, whoever does errata fixes likes to run the eBook through our complete current automated checks, to identiy & fix any other things that need attention. This can include updating the file location & naming scheme, for eBooks < #10000. It's *always* non-trivial, and can sometimes take hours. We often need to touch multiple files (different character sets for .txt, an HTML, and perhaps others). While the small number of whitewashers are the only people who can actually push a new version of an eBook to the main servers, we are always happy to accept completely updated files. Those can be much faster to post. Such files are seldom what we get. Even when we do, see the above paragraphs about confirming changes and applying other needed fixes. Finally, let me remind everyone that we *are* tecnical people (also literary, volunteer-oriented, etc.). It is not at all challenging to find errors in PG's eBooks. Fixing them is the challenge. -- Greg From gbnewby at pglaf.org Tue Apr 15 13:16:31 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Tue, 15 Apr 2008 13:16:31 -0700 Subject: [gutvol-d] whitewasher stops perpetual proofing machine In-Reply-To: References: Message-ID: <20080415201631.GB22999@mail.pglaf.org> On Mon, Apr 14, 2008 at 07:35:16PM -0400, Bowerbird at aol.com wrote: > greg said: > > Thanks for this errata report.? I'm forwarding it > > to the folks who can take a look & fix it. > > if i woulda known someone would act, i woulda given these links to the scans: Fascinating stuff, and I apologize especially to Al for not looking more closely at the errata report below before forwarding it. There is (at least) one lesson here, and at least one example. The lesson is that the decision on whether to fix errors in the printed text is up to the eBook producer. (A correlary is that later people, applying potential fixes, might make their own decisions.) The example, below, is a great demonstration of why nobody in the processing chain is enthusiastic about taking errata reports and applying them blindly. It's very, very typical for errata to be, at least partially, in the eye of the beholder. -- Greg > > > was moving only about five hundred feet from it's companion, GO-11. But, > > was moving only about five hundred feet from its companion, GO-11. But, > > http://z-m-l.com/go/plans/plansp044.html > > > "Nuts--let's give this sick rat to the Space Force right now." Art Kuzak > > "Nuts--let's give this sick rat to the Space Force right now," Art Kuzak > > http://z-m-l.com/go/plans/plansp051.html > > > sent the party? I saw where there rocket ship must have stood--a glassy, > > sent the party? I saw where their rocket ship must have stood--a glassy > > http://z-m-l.com/go/plans/plansp062.html > > > must be a new, popular song. He had heard so few new songs. > > must be a new popular song. He had heard so few new songs. > > http://z-m-l.com/go/plans/plansp068.html > > > threat of slow dying, an ordeal, as the sagging dome was torn from above > > threat of slow dying, in ordeal, as the sagging dome was torn from above > > http://z-m-l.com/go/plans/plansp070.html > > > lunar wilderness... What a switch--didn't think you'd goof! The > > lunar wilderness... What a sitch--didn't think you'd goof! The > > http://z-m-l.com/go/plans/plansp073.html > > > highway. There were other rough stretches, but most of the well selected > > highway. There were other rough stretches, but most of the well-selected > > http://z-m-l.com/go/plans/plansp075.html > > i do believe i've switched my mind on this one. i think _kuzaks'_ is > correct. > > idea of where the Kuzaks' supply post was, and the dizzying distance to > > idea of where the Kuzak's supply post was, and the dizzying distance to > > http://z-m-l.com/go/plans/plansp089.html > > > grubbed it, yourself? Sell it. Get the stink blown off you--forget some > > grubbed it yourself? Sell it. Get the stink blown off you--forget some > > http://z-m-l.com/go/plans/plansp110.html > > > and these lines need to be tightened: > > "Serene... > > Found a queen... > > And her name is Eileen..." > > http://z-m-l.com/go/plans/plansp068.html > > -bowerbird > > > > ************** > It's Tax Time! Get tips, forms and advice on AOL Money & > Finance. > (http://money.aol.com/tax?NCID=aolcmp00300000002850) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From jeroen.mailinglist at bohol.ph Tue Apr 15 14:31:15 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Tue, 15 Apr 2008 23:31:15 +0200 Subject: [gutvol-d] handling errata In-Reply-To: <20080415193046.GA22999@mail.pglaf.org> References: <20080415193046.GA22999@mail.pglaf.org> Message-ID: <48051EA3.8010503@bohol.ph> Greg Newby wrote: > (Changing the subject line, since this has nothing to do with > DP's proofreading rounds) > In this light, it would be very nice to import Gutenbergs main repository into a Distributed Revision Control System. This would have several benefits - We could keep exact track of what is changing where, and who made the change. - We can allow interested people to pull a copy of the entire repository, make their changes at will, and let them publish it as a changeset. We can then review if what they have done is what we like, and pull the changes back into the main repository. - Synchronizing with the growing collection would be as easy as saying 'pull' once in a while. Much easier than nsync or any other tool I have seen. I am now working with my master files in Bazaar (bzr) for half a year, and am quite happy, although, for the entire size of PG, this tool may not yet be up-to-speed. I also have experience with Subversion, and various commercial VCSes. On the downside: - Overhead of about 200% for repository files and history. - No good solution for .zip and .mp3 files, or other derived materials. (But we should use generation on the server on demand with caching for these anyway.) - Maybe confusing for non-technical people. Jeroen. From Bowerbird at aol.com Tue Apr 15 15:22:17 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Apr 2008 18:22:17 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: greg said: > There is (at least) one lesson here, and at least one example. ok, i always love lessons and examples... > The lesson is that the decision on whether to fix errors > in the printed text is up to the eBook producer.? the producer of this e-text decided to _fix_the_errors_. well over two dozen were fixed, as noted in the e-text. > (A correlary is that later people, > applying potential fixes, > might make their own decisions.) people might. and why shouldn't they? but that's not what is at issue in this case. more importantly, people might _find_mistakes_ in the p-book that the original e-text producer _failed_to_notice_, and provide the corrections, under the assumption that the original producer _surely_ meant to fix those as well, a safe bet... (why would he fix just two-thirds of the errors?) some of these might be "a matter of opinion". but it's equally true that some might _not_ be, that any sane person looking at the mistake will agree with everyone that it _is_ a mistake. for instance, i say you misspelled "corollary". do you disagree? is that a matter of opinion? and even _more_ importantly, people might find places in the text where the original producer made a mistake in transcribing what was there, in the p-book, and bring that to your attention. in other words, no error in the paper-book, but an error in the e-text. without any equivocation. > The example, below, is a great demonstration > of why nobody in the processing chain is > enthusiastic about taking errata reports and > applying them blindly.? who has suggested you should "apply them blindly"? has anyone -- _anyone_ -- suggested that _ever_? such a ridiculous suggestion could be laughed at, and dismissed out of hand, wouldn't you agree? so why do you feel the need to drag in a strawman? this was a _careful_ report, the product of _much_ work -- not by myself, but by proofers who have pored over that "perpetual" text _11_times_ now -- and their work was not brought into the book. so i wrote it up, all nice and easy for you, and even gave you some convenient links to the page-scans. > It's very, very typical for errata to be, > at least partially, in the eye of the beholder. you tell me how many of the 10 changes i suggested are "in the eye of the beholder", so we _both_ know... and tell me how many of the changes _you'd_ make... otherwise, be more careful where you cast aspersions. *** well, this could have been a quiet little error report... but there's been so much buck-passing on the thing that now i'm gonna make sure we go over every single item on that report and discuss whether it is an error, or whether it is simply "in the eye of the beholder"... let's start with the first one, ok? >? ? was moving only about five hundred feet from it's companion, GO-11. But, >? ? was moving only about five hundred feet from its companion, GO-11. But, >? ? http://z-m-l.com/go/plans/plansp044.html is there anyone who disagrees that this was a real error? don't be shy, if you think the top line is right, speak up! -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080415/c16260a4/attachment.htm From Bowerbird at aol.com Tue Apr 15 15:27:46 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Apr 2008 18:27:46 EDT Subject: [gutvol-d] =?iso-8859-1?q?!=40!_Re=3A=A0_whitewasher_stops_perpet?= =?iso-8859-1?q?ual_proofing_machine?= Message-ID: um, michael... read the thread first, before you post, ok? your message was full of misunderstandings... thanks... -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080415/bd6660a5/attachment.htm From gbnewby at pglaf.org Tue Apr 15 17:13:15 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Tue, 15 Apr 2008 17:13:15 -0700 Subject: [gutvol-d] whitewasher stops perpetual proofing machine In-Reply-To: References: Message-ID: <20080416001315.GB27574@mail.pglaf.org> On Tue, Apr 15, 2008 at 06:22:17PM -0400, Bowerbird at aol.com wrote: > > let's start with the first one, ok? > >? ? was moving only about five hundred feet from it's companion, GO-11. But, > >? ? was moving only about five hundred feet from its companion, GO-11. But, > >? ? http://z-m-l.com/go/plans/plansp044.html > > is there anyone who disagrees that this was a real error? > don't be shy, if you think the top line is right, speak up! (I'm assuming the page scan at http://z-m-l.com/go/plans/plansp044.html is the same as what was used by the producers.) The top line is clearly correct, since it's obvious in the page scan that there is an apostropher in "it's" The bottom line is clearly correct, since the top line is not grammatical, but the bottom line is and fits more correctly in the context. Some producers like to adhere to the exact printed text, including preserving errors. Other producers like to fix various things. As has been made abundantly clear, it is PG's policy to allow either approach, or neither, or a blend. -- Greg From Bowerbird at aol.com Tue Apr 15 19:50:26 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Apr 2008 22:50:26 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: greg said: > (I'm assuming the page scan at > http://z-m-l.com/go/plans/plansp044.html > is the same as what was used by the producers.) it should be completely unnecessary to say this, but of _course_ it is. why would i use any other? > The top line is clearly correct, since it's obvious > in the page scan that there is an apostropher in "it's" we agree that there is an apostrophe in "it's" in the scan... just like you made an error in your spelling of "apostrophe", typesetters make errors. this is such a case. that's an error. it's an obvious error. it's ridiculous to say it's "clearly correct". face it, even in _grade-school_, it would be marked incorrect. > The bottom line is clearly correct, > since the top line is not grammatical, > but the bottom line is at least you haven't taken leave of your senses entirely... ;+) > but the bottom line is > and fits more correctly in the context. i'll ignore that you said "more correctly", which is misleading, intentionally or not, since the top line is completely incorrect, and the bottom line is completely correct, so it's not a matter of _degree_ in the slightest, as the word "more" would imply. the bottom line is correct. the top line is not. that's an error. > Some producers like to adhere to the exact printed text, including > preserving errors.? Other producers like to fix various things.? and, as i said, this producer chose to fix the errors. but he didn't fix all of them. besides, when a producer chooses _not_ to fix the errors, he should annotate them clearly, so the reader is assured the mistake existed in the original, and was not introduced. the traditional way of indicating this is to use (sic), as i am quite sure you know, greg. why are you trying to duck this? *** ok, let's go on to the next one: > "Nuts--let's give this sick rat to the Space Force right now." Art Kuzak > "Nuts--let's give this sick rat to the Space Force right now," Art Kuzak > http://z-m-l.com/go/plans/plansp051.html i'll tell you right now the scan is ambiguous. i think there's a comma there, but i wouldn't wanna argue with anyone that it _might_ be a period instead. the point is, the sentence construction calls for a _comma_, so i say that the current e-text -- which has a period instead -- is in error. who disagrees? -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080415/1ed6a44a/attachment-0001.htm From ajhaines at shaw.ca Tue Apr 15 22:59:11 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 15 Apr 2008 22:59:11 -0700 Subject: [gutvol-d] whitewasher stops perpetual proofing machine References: Message-ID: <001c01c89f86$fe6ba280$6501a8c0@ahainesp2400> An excellent example of How to Win Friends and Influence People. Carnegie would be proud... See PG FAQ V.128 re the "[sic]" convention. ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Tuesday, April 15, 2008 7:50 PM Subject: Re: [gutvol-d] whitewasher stops perpetual proofing machine greg said: > (I'm assuming the page scan at > http://z-m-l.com/go/plans/plansp044.html > is the same as what was used by the producers.) it should be completely unnecessary to say this, but of _course_ it is. why would i use any other? > The top line is clearly correct, since it's obvious > in the page scan that there is an apostropher in "it's" we agree that there is an apostrophe in "it's" in the scan... just like you made an error in your spelling of "apostrophe", typesetters make errors. this is such a case. that's an error. it's an obvious error. it's ridiculous to say it's "clearly correct". face it, even in _grade-school_, it would be marked incorrect. > The bottom line is clearly correct, > since the top line is not grammatical, > but the bottom line is at least you haven't taken leave of your senses entirely... ;+) > but the bottom line is > and fits more correctly in the context. i'll ignore that you said "more correctly", which is misleading, intentionally or not, since the top line is completely incorrect, and the bottom line is completely correct, so it's not a matter of _degree_ in the slightest, as the word "more" would imply. the bottom line is correct. the top line is not. that's an error. > Some producers like to adhere to the exact printed text, including > preserving errors. Other producers like to fix various things. and, as i said, this producer chose to fix the errors. but he didn't fix all of them. besides, when a producer chooses _not_ to fix the errors, he should annotate them clearly, so the reader is assured the mistake existed in the original, and was not introduced. the traditional way of indicating this is to use (sic), as i am quite sure you know, greg. why are you trying to duck this? *** ok, let's go on to the next one: > "Nuts--let's give this sick rat to the Space Force right now." Art Kuzak > "Nuts--let's give this sick rat to the Space Force right now," Art Kuzak > http://z-m-l.com/go/plans/plansp051.html i'll tell you right now the scan is ambiguous. i think there's a comma there, but i wouldn't wanna argue with anyone that it _might_ be a period instead. the point is, the sentence construction calls for a _comma_, so i say that the current e-text -- which has a period instead -- is in error. who disagrees? -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080415/0956b176/attachment.htm From julio.reis at tintazul.com.pt Wed Apr 16 01:01:55 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Wed, 16 Apr 2008 09:01:55 +0100 Subject: [gutvol-d] handling errata In-Reply-To: <20080415193046.GA22999@mail.pglaf.org> References: <20080415193046.GA22999@mail.pglaf.org> Message-ID: <1208332915.6531.27.camel@abetarda> Ter, 2008-04-15 ?s 12:30 -0700, Greg Newby escreveu: > Fixing errors is not PG's highest priority. Sorry. Hear hear, and it's not simply a matter of choosing quantity over quality; quite the opposite, because quality in a library has very much to do with the sheer amount of books it owns. So for Gutenberg to achieve quality, it must have quantity. The rate of errors in books is important of course; but at least to me, ten times less important than the quantity of books it carries. We the Gutenberg volunteers have quite a lot of ground to cover. Every book, newspaper or pamphlet is nice reading to someone; and of research value for others. And when instead of single items you start having *collections*, then the value of each increases tenfold. Of course we want to keep the errors to a minimum; but without losing sight of the fact that quantity is worth ten times more than typographic perfection. > (We will VERY STRONGLY support anyone who wants to create > some sort of project, affiliate site, etc. to receive & > process errata. I can easily imagine that such an effort > could feed back changed eBook files to the 'main' PG.) I will just point out that, for instance, the domain gutfix.org is free. And that I'd volunteer for that, too. J?lio. From Bowerbird at aol.com Wed Apr 16 01:11:33 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Apr 2008 04:11:33 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: al said: > An excellent example of How to Win Friends and Influence People.? > Carnegie would be proud... i hate that philosophy! it plays to the very _worst_ aspects of human beings, with the goal of manipulation. it is an extremely sick, twisted methodology. *** now let's look at the next suggestion i made: >? ?I saw where there rocket ship must have stood--a glassy, spot where >? ?I saw where their rocket ship must have stood--a glassy spot where >? http://z-m-l.com/go/plans/plansp062.html ooh, this is a good one, because there are two errors there. so, good thing the proofers were using their eyeballs there... but we'll just tackle the first error right now... look at the scan if you want, but i can tell you straight out that the top line is accurate as to what was on the page in the book. so, is there anyone out there who says the top line is correct? you know, where it says "there rocket ship". anybody at all? speak up, speak up loudly, so we know that you are there... your mommy will give you a trophy for trying _so_ hard, and dale carnegie will keep on whispering your name in your ear. or do you think it should say "their rocket ship"? i know i do. and i'm betting if you are one of those fortunate people who can keep their "their" right here and their "there" over there, that you are a person who knows the bottom line is correct... just in case you're not keeping track, i'm batting 3 for 3 so far. -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080416/695b53cd/attachment.htm From Bowerbird at aol.com Wed Apr 16 01:32:25 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Apr 2008 04:32:25 EDT Subject: [gutvol-d] handling errata Message-ID: julio said: > Hear hear, and it's not simply a matter of > choosing quantity over quality; quite the opposite, > because quality in a library has very much to do with > the sheer amount of books it owns. you lost this battle over at d.p. two years ago, when the shift was made from 2 rounds to 5... quantity has suffered -- badly -- ever since... by the way, i woulda voted for "quantity" myself. you get "quality" when you start taking error-reports from the millions of eyeballs that look at the e-texts... of course, you have to have an infrastructure in place that can _handle_ those error-reports, which is what i have been saying here for several _years_ now, while absolutely _nothing_ has been changed in that regard. > So for Gutenberg to achieve quality, it must have quantity. oh please. p.g.'s input these days is google's output, so p.g. will _always_ be behind in quantity. and by a _huge_ margin. and once google turns on _their_ error-correcting routines, you'll quickly be left far behind in the quality department too. > The rate of errors in books is important of course; > but at least to me, ten times less important than > the quantity of books it carries. and you would lose this battle again at d.p. if it were held today. go look at the "how much quality would you like?" poll over there. you'll find almost everyone in the survey picked 10 errors at most for a 200-page book as the _minimal_ level of quality they desire. half the people said 4.5 errors was the most they would tolerate... this was a 160-page book, and it had 10 errors in it, as posted... so, let's review... d.p. pats itself on the back constantly because it has such a high rate of quality. and d.p. people _want_ quality, so much so that they invented a tortured proofing hierarchy that they _thought_ (wrongly) would give it to 'em, and they've become accustomed to the thought that they do put out high-quality work, but if you examine it closely, you find that it isn't that good at all, and when you provide a detailed error-report to get errors fixed, you quickly get buried in a sea of "this is a low priority" messages... meanwhile, public opinion -- that stuff that blows in the breeze -- (for as long as i've known) holds that p.g. books are error-ridden. but your mommy will still give you a trophy, and dale still loves you. -bowerbird ************** It's Tax Time! Get tips, forms and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolcmp00300000002850) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080416/4916e236/attachment.htm From hart at pglaf.org Wed Apr 16 10:06:05 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 16 Apr 2008 10:06:05 -0700 (PDT) Subject: [gutvol-d] Drawing the Line Message-ID: In response to ALL the efforts over the years to draw lines to eliminate certain aspects of Project Gutenberg, even for the elimination of certain kinds of people, I think perhaps the best response is to quote the great Donald Knuth: "Premature optimization is the root of all evil." Donald Knuth For those of us who don't recognize that name, we probably, almost certainly, owe him more than we can imagine. In the history of eBooks, even after nearly 40 years, it is still The Beginning of eBooks, and not time for optimizing, still time for trying more and more alternatives. In the end, I think we will find that the "defining aspect" of eBooks, as will be written up decades, or centuries from the present time, and that we will find that all we did was to provide the raw materials for whatever become acceptable standards for the publication of eBooks when some processes have been added that will redefine eBooks so thoroughly any reader of that period could tell at a glance which were the product of the primitive eBook Era, and which were products of the mature eBook Era. Personally, I don't think our perspectives will even create the mental space necessary for such a shift from primitive, hopefully into the world of maturity. Michael From Bowerbird at aol.com Wed Apr 16 12:21:50 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Apr 2008 15:21:50 EDT Subject: [gutvol-d] Drawing the Line Message-ID: that's an interesting observation, michael. i certainly couldn't hope to match knuth... but my girlfriend enjoys doing sudoku and i have noticed with that type of puzzle that sometimes getting just one number to fall into place means the rest of it solves itself. so maybe we're _not_ as far away from your "defining aspect" as we might think we are... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080416/198729ca/attachment.htm From Bowerbird at aol.com Wed Apr 16 12:28:31 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Apr 2008 15:28:31 EDT Subject: [gutvol-d] the deadline has passed, time to pay your karma tax Message-ID: my deadline for distributed proofreaders to act has passed... karma generally acts _slowly_, so d.p. still has time to extract itself from the quicksand, though merely thrashing about will only make things worse. remember, the deadline has passed. -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080416/5a5fb386/attachment.htm From hart at pglaf.org Wed Apr 16 12:52:35 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 16 Apr 2008 12:52:35 -0700 (PDT) Subject: [gutvol-d] Drawing the Line In-Reply-To: References: Message-ID: On Wed, 16 Apr 2008, Bowerbird at aol.com wrote: > that's an interesting observation, michael. > > i certainly couldn't hope to match knuth... > > but my girlfriend enjoys doing sudoku and > i have noticed with that type of puzzle that > sometimes getting just one number to fall > into place means the rest of it solves itself. > > so maybe we're _not_ as far away from your > "defining aspect" as we might think we are... > > -bowerbird I agree in the sense that we may only be a step or two away from some "defining ebook standard" that will go down in history for a while. However I think it may take any number of YEARS to get from one step to the next. So even though the number of steps might be few it might well be that the number of years is so great that I might not even be here to see it. Obviously we took several of those giant steps, just getting up to our eBook #10,000, proving a feasibility for eBooks that pundits or prophets alike had been trying to denounce. As soon as we proved feasibility Google jumped, and at least pretended to invent the eBook, but in truth they may not have even taken one step, in that historic project, as they haven't had a real effect on how eBooks are even perceived or much less used. I think the next "real" step will come when the $100 terabyte drives become extremely popular-- AND--people by the thousands, then the millions start to download and OWN their own libraries. "Personal Computer" = "Personal Library" Before Johannes Gutenberg the average person owned 0 books. Before Project Gutenberg that average person owned 0 libraries. This is going to change the idea/l of library. Once that many people OWN their own libraries, one of the results will be reformatting "THEIR LIBRARIES" to match their own personal taste. Eventually there will be a certain number from which to choose, of the most popular reformats available, and that might be the "next step" a world of eBooks will see and might have little or nothing to do with what ALL the pioneers of eBooks ever had in mind. . .might. If we are anywhere near as succesful in making a world as populated with eBooks as I hope, we should expect to have the entire process taken out of our hands, not by Google, they have not the "vision" to invest what it takes even with a hundred billion dollars to spare. No, I think it will continue to be grassroots, as opposed to "astroroots," as operations such as Google's are coming to be called. Any number of people will collect up millions, and millions of eBooks, and eventually numbers of them will work out their own personal ways, just for themselves, of searching, viewing and otherwise using THEIR "personal libraries" and the rest of the world will benefit. The more different ways to do this, then more, and more, and more people will find acceptable ways for themselves to own and use libraries. In 10 years there will be petabytes for these, and people will be able to keep copies of each book ever written in all history. Things will continue to change more and more. Michael S. Hart Founder Project Gutenberg Inventor of eBooks From Bowerbird at aol.com Wed Apr 16 13:35:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Apr 2008 16:35:42 EDT Subject: [gutvol-d] Drawing the Line Message-ID: michael said: > Any number of people will collect up millions, > and millions of eBooks, and eventually numbers > of them will work out their own personal ways, > just for themselves, of searching, viewing and > otherwise using THEIR "personal libraries" and > the rest of the world will benefit. you listed "searching and viewing", which i think are two of the biggies. then you add "otherwise using". what are those other uses? i think they boil down to two things, basically. one is _remixing_ -- taking bits and pieces of books and assembling them into new, different creations... the other is _synthesis_ -- linking parts of a book to parts of other books, for the purpose of creating a new perspective by which those pieces are viewed. the crux of all this is that we need _digital_text_ to accomplish most of these purposes, even crudely... you talk about "millions and millions of e-books", and "petabytes", but the fact of the matter is that until we find a way to move quickly to digital text, we simply do not now have _millions_ of e-books. and we won't be able to have them soon, either... and if our e-books are in .pdf format, which is "the roach motel" from which text cannot leave, then we won't have text that is truly digital either. now, i'm optimistic, because i can see the way that we _can_ move quickly to digital text. very quickly. but it's _not_ the d.p. way. the d.p. way is wasting lots of volunteer labor... and it's slow. so until we improve or replace d.p., we're stuck... we're stuck in quicksand... moving from scan-sets to digital text is _the_ puzzle-piece to which i was referring with my sudoku reference. once that obstacle is cleared, we'll find the rest of it rushing toward us very fast. but until then, we're knee-deep in the quicksand... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080416/f1136770/attachment.htm From hart at pglaf.org Wed Apr 16 17:09:47 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 16 Apr 2008 17:09:47 -0700 (PDT) Subject: [gutvol-d] Drawing the Line In-Reply-To: References: Message-ID: On Wed, 16 Apr 2008, Bowerbird at aol.com wrote: > michael said: >> Any number of people will collect up millions, >> and millions of eBooks, and eventually numbers >> of them will work out their own personal ways, >> just for themselves, of searching, viewing and >> otherwise using THEIR "personal libraries" and >> the rest of the world will benefit. > > you listed "searching and viewing", which i think are > two of the biggies. then you add "otherwise using". > > what are those other uses? Looking up in context definitions, famous quotations in context, doing word and phrase counts, names, etc. Looking up geographical and chronlogical information. [snip] > the crux of all this is that we need _digital_text_ to > accomplish most of these purposes, even crudely... Absolutely, but no mention of any particular format needed. > you talk about "millions and millions of e-books", > and "petabytes", but the fact of the matter is that > until we find a way to move quickly to digital text, > we simply do not now have _millions_ of e-books. You pretend not to have noticed it's a decade from now for petabyte drives, etc. . .shame on you! However, anyone with a broadband connection can simply set up a webcrawler and start on their million eBooks. Well, last year some of my Geek Lunch crowd were not happy unless they downloaded at least 2G per day. At 2.75G per day it takes only one year = million at a million characters per book. Using various compressed files plus online compression in transit can get you up to about 2.75G, thus reducing either the daily download need or the number of days. Of course, this is already out of date, as Geek Lunchers have already told me they now have an even faster service, but I haven't gotten any benchmarks on it yet. > and we won't be able to have them soon, either... You probably would have said the same thing when I first started Project Gutenberg about getting to 10,000 books. The longest journey starts with but a single step. Want a head start? There are a number of services, included PG, that will send you 10,000 - 20,000 eBooks on CD and DVD. > and if our e-books are in .pdf format, which is > "the roach motel" from which text cannot leave, > then we won't have text that is truly digital either. Once again you are living in the past. There are now ways to get eBooks out of .pdf, just hire some 15 year old hacker to do it for you. It's a pain, but it can be pretty automated. . . . However, one way or another, there are millions of such eBooks in a number of formats, free for the taking, and there is enough bandwidth to make it possible to get a million of these per year with old technoligies, but a lot faster this year. > now, i'm optimistic, because i can see the way that > we _can_ move quickly to digital text. very quickly. [snip] Actually, a Fed-X plane full of DVDs is pretty fast. ;-) You wanna complain none of this is not fast enough? Just think about what it was like 10, 20 30 years ago. Hee hee! Michael From Bowerbird at aol.com Wed Apr 16 22:11:01 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 17 Apr 2008 01:11:01 EDT Subject: [gutvol-d] Drawing the Line Message-ID: michael said: > Looking up in context definitions, famous quotations > in context, doing word and phrase counts, names, etc. > Looking up geographical and chronlogical information. great examples... :+) do they fall under the category i have labeled "search"? or are they something different? if so, what is its label? i would note that at least some of these concerns are at the _library_ level, not the individual _e-book_ level. that's the level i _like_ to think about. but to get to that, there needs to be some consistency at the _e-book_ level. because that library-level stuff requires the use of computers, which are handling the library as a large database of content... > Absolutely, but no mention of any particular format needed. as long as the format does not obscure the data, we're fine... i use zen markup to strip away the trivialities of presentation, so i've pretty much got unobscured content from the get-go... but somebody else with another format could just do some programming, and strip away until they had pure content too. format is the container. no use wasting time talking about it. > You pretend not to have noticed it's a decade from now > for petabyte drives, etc. . .shame on you! oh, we'll have drives capable of storing millions of e-books. and we'll even have millions of e-books -- in scan-set form. but there's no assurance we'll have the _digital_text_ of them. and, as you yourself have repeated so eloquently in the past: "a picture of a book is not a book." even more to the point, "a scan of a page is not the text". we don't have the digital text of millions and millions of books. and with d.p. digitizing 2,345 books per year, we _never_ will... (where "never" means not for 400 years, give or take a century.) > However, anyone with a broadband connection can simply > set up a webcrawler and start on their million eBooks. you can download a million _scan-sets_. (but why wear out your wire when you can swap peta-drives?) but you cannot download a million e-books (as digital text)... that's the difference. until we crack this nut, we haven't got a cyberlibrary. we've just got a lot of pictures of some ink on paper... > There are a number of services, included PG, that will > send you 10,000 - 20,000 eBooks on CD and DVD. right. and between distributed proofreaders and nicholas hodson and al haines and david widger and david moynihan and richard seltzer and whoever else pops up, that number will increase by about 5,000 e-books every year. which means we'll get to a million e-books _in_digital-text_form_ in about 200 years. are you ready to wait that long? > Once again you are living in the past. well, i'm still running "tiger" as my mac o.s., not having upgraded to "leopard" yet. but i'm holding back in solidarity with my "xp" brothers on the p.c. side who hate "vista"... so i guess, in some ways i'm living in the past... ;+) no, i'm safely mired live in the immediate present, maybe projecting my way, oh, say 5 years ahead... i'd like to see 10 million books in digital-text form in the next 5 years, and i see a pretty good path to getting there. but it's a path that nobody is taking. well, google probably is. but they're wearing harry's cloak of invisibility, so we only see the leaves rustle... > There are now ways to get eBooks out of .pdf, > just hire some 15 year old hacker to do it for you. > It's a pain, but it can be pretty automated. . . . i think i know a lot more about this than you do, michael. in fact, i'm pretty sure of it. yep, i'm real sure i know more. what you get out is the text, with a whole lot of junk in it, and a not-insignificant amount of damage to the content... up above when i said that "format doesn't matter", it was with the only-reasonable caveat that the format does no harm to the content. sadly, .pdf does not make that grade. (technically, it _can_, if you create the .pdf with that in mind; but i'm not aware of any e-book creation-tools that do that at this time, except for my own, so take my word on all this.) > However, one way or another, there are > millions of such eBooks in a number of formats, > free for the taking well, no, sorry, michael, but there isn't. there just isn't. not as digital text. it would be nice if there was, but there isn't. there are lots of variants of p.g. e-texts around, but that's not really "millions" of books, that's just overexuberant duplication of thousands of books... and i'm sorry, but unless internet archive is holding back on the posting of their books, they ain't close to 1 million, let alone plural. and much of their text is highly suspect... even if you count the books in the million-book project, most of which are not the kind you'd find in your library even if you live in one of the countries they specialize in (unless you happen to be an agronomist in such a nation), even when you count them, you still won't have "millions". and notice that i'm not even saying that you need "millions". personally, i think _one_million_ would be all that i'd need. but the fact remains that we're nowhere near _that_, either. and at the current rate of progress, we can't get there soon. > Actually, a Fed-X plane full of DVDs is pretty fast. ;-) yes, it is... :+) > You wanna complain none of this is not fast enough? > Just think about what it was like 10, 20 30 years ago. > Hee hee! yeah, i guess you're right. 20 years ago, it took about 3 years to do an average book. but these days, over at d.p., they can do a book in... um... well, actually, the average is it'll take them about 3 years... but you have to remember that they have a lot more people working on that book... oh wait. well, _anyway_... ;+) -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080417/7c15bbd7/attachment-0001.htm From walter.van.holst at xs4all.nl Wed Apr 16 23:46:23 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Thu, 17 Apr 2008 08:46:23 +0200 Subject: [gutvol-d] Drawing the Line In-Reply-To: References: Message-ID: <4806F23F.8060408@xs4all.nl> Michael Hart wrote: > As soon as we proved feasibility Google jumped, > and at least pretended to invent the eBook, but > in truth they may not have even taken one step, > in that historic project, as they haven't had a > real effect on how eBooks are even perceived or > much less used. Actually, they have gone at least one step forward by adding services such as search and geodata to the books they've digitized. Regards, Walter From Bowerbird at aol.com Thu Apr 17 11:17:55 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 17 Apr 2008 14:17:55 EDT Subject: [gutvol-d] parallel -- chris and the clockmakers -- 12 Message-ID: more stuff from "christopher and the clockmakers", from the second parallel proofing experiment at d.p. *** ok, it's probably a good time to give a firm example of what i mean when i say an error can be "auto-detected". one error of this type is when the o.c.r. misrecognizes a period as a comma. the dead giveaway to this one is the following word -- as the start of a new sentence -- is capitalized, when (if it were just following a comma) it wouldn't normally be capitalized. (unless it's a name, or a title, or had another reason for being capitalized.) so we can easily have the computer check for this error. so let's do that on "christopher and the clockmakers"... below are the capitalized words preceded by a comma which are contained in the tool's spellcheck dictionary (which -- by design, for this check -- has no names). throw out all of these words that are a name or title... (hint, the publisher here is little, brown, and company.) > AND publisher name > And publisher name > BROWN publisher name > Chief title > Dad title > Doctor title > I personal pronoun > Inspector title > Jewellers business name > John name > Junior name > Lord title > Master title > Miss title > Mother title > Nevertheless oops! > Person oops! > SAPPHIRES chapter title > Sapphires chapter title > Senior title > Tony name that leaves two capitalized words followed by commas, where the capitalization is not due to a name or a title... and sure enough, upon examination of them, we find those two cases to be errors, in p1 or in r1 or in both... =p1=> one you usually receive from us, Nevertheless, =r1=> one you usually receive from us, Nevertheless, =p1=> too much of a luxury. People therefore consulted =r1=> too much of a luxury, People therefore consulted the important thing to note here is how the tool _focused_ our attention where it needed to be to find and fix errors... we didn't have to scour the pages. the tool found the bug. whereas a human proofer might have glossed right over it. (one of these errors was missed by two rounds of proofers, and the other was missed by one of two rounds of proofers.) heck, during iteration#7 of the "perpetual planet strappers", a p1 proofer has just found an incorrectly-capitalized word which was missed by the _8_ rounds of proofing before that. but a good tool will summon up such an error immediately, show you the scan, and let you make the correction by just clicking a button to make the necessary edits and re-save... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080417/1da03cd4/attachment.htm From walter.van.holst at xs4all.nl Thu Apr 17 11:34:46 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Thu, 17 Apr 2008 20:34:46 +0200 Subject: [gutvol-d] Drawing the Line In-Reply-To: References: <4806F23F.8060408@xs4all.nl> Message-ID: <48079846.4020906@xs4all.nl> Michael Hart wrote: >> Actually, they have gone at least one step forward by adding services >> such as search and geodata to the books they've digitized. > Was this just for me? My bad, am used to lists that have reply-to-list. > Personally, I think search was there before Google > and I haven't found any geodata that falls outside > the categories we had in school years ago. > > Am I missing something??? What Google Print does is a) search specifically geared towards books and b) for every book it provides a map with the places mentioned in the book on it. Neither is a quantum leap forward and the ultimate possibilities in terms of concordances, cross references etc. haven't been thought of yet, but it is a step further than just providing the data. Regards, Walter From Bowerbird at aol.com Fri Apr 18 10:29:22 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 18 Apr 2008 13:29:22 EDT Subject: [gutvol-d] actively hiding head in the sand -- the quicksand Message-ID: d.p. continues with its active effort to hide its head in the (quick)sand on all the data i've been presenting here about its _own_ experiments... they say these books are atypical. yeah, right. so why were they chosen for these experiments then? because if they really are atypical, _any_ results -- _including_ those obtained by d.p. -- won't qualify for generalization. no, these books were quite typical. they had the typical incompetence by content providers... and they presented with the expected pattern. you know, the pattern that seems to capture a "common-sense" take, which is that p1 fixes most of the errors, p2 gets most of the remaining ones, and p3 comes in and does clean-up. again, this is the pattern you get on page after page, in book after book, day after day, over in d.p.-land... nothing could be more typical... this "atypical" charge is just a dodge. look at any d.p. 10 projects -- chosen at random -- and you'll find the expected pattern described above on 9 of them. and the 10th will be the atypical one... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080418/0c63c59e/attachment.htm From Bowerbird at aol.com Fri Apr 18 10:32:32 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 18 Apr 2008 13:32:32 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: and here's the next post in our series looking at those ten corrections that i submitted for "planet strappers", to see which are "in the eye of the beholder". ok, now let's look at the second error in this line: >?? I saw where there rocket ship must have stood--a glassy, spot where >? I saw where their rocket ship must have stood--a glassy spot where >? http://z-m-l.com/go/plans/plansp062.html there _is_ a comma on the p-book page, and the top line accurately reflects that... there is no one disputing that reality. but since "glassy" is an adjective that describes the noun "spot", there is absolutely no call for a comma between the two. so it's clearly an error. i _did_ consider the possibility that this author was using "glassy" as a _noun_ that described these "spots" where a rocket ship had landed, but none of the other occurrences of that term supported that theory... so this is an obvious error. so far, 4 clear errors out of 4 examined. *** this is fun. let's do another one... >? must be a new, popular song. He had heard so few new songs. >? must be a new popular song. He had heard so few new songs. >? http://z-m-l.com/go/plans/plansp068.html on this page, there is a splotch, but clearly no comma after "new". (there's no ink below the baseline, let alone space for the comma.) the "rules" about the use of commas to separate multiple adjectives are not clear-cut enough to instruct us _unequivocally_ in this case. so if you really wanted to call the absence of that comma "an error" and you chose to introduce a comma to "fix the typesetter's error", i couldn't really _argue_ with that. but in my opinion, i don't think it's _necessary_, so i would've let this stand as it was in the p-book. i am reluctant to call something an error unless it's clearly an error, because that's a nicety i think we should extend to the old-timers... after all, they did all their work without our newfangled computers. still, i'm willing to grant that this one is "in the eye of the beholder". i'm not clearly wrong, and neither are you, so i'd call this one a draw. so, 4 clear errors out of 5 examined, with 1 "eye of the beholder". i'd say, at the halfway point, my error-report looks very accurate... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080418/13cf3309/attachment.htm From ajhaines at shaw.ca Fri Apr 18 11:06:24 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Fri, 18 Apr 2008 11:06:24 -0700 Subject: [gutvol-d] whitewasher stops perpetual proofing machine References: Message-ID: <001d01c8a17e$ea753440$6601a8c0@ahainesp2400> It should be pointed out that you didn't "submit" them at all. You simply trumpeted them all over this forum, after which *Greg* forwarded them to the errata list. Let's assume (*not* stipulate) that some, not all, of these errors are actually errors. So what? The handful of items that you found do not in any way spoil the book for anyone's casual reading, except maybe yours. (BTW - just in that original handful of page scans, *I* spotted an "item" *you* did not. (You would have mentioned it if you had.) So much for *your* expertise... (Sorry, no hints.) It should be also pointed out that the *only* person who has *any* say on fixing these or any other items, is the original submitter. Until you can demonstrate that you can produce error-free submissions, you don't have a leg to stand on in criticizing others' submissions. To slightly re-cast a phrase, "Those that can, do; those that can't or won't, preach." Too bad the choir has left the building... Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Friday, April 18, 2008 10:32 AM Subject: Re: [gutvol-d] whitewasher stops perpetual proofing machine and here's the next post in our series looking at those ten corrections that i submitted for "planet strappers", to see which are "in the eye of the beholder". ok, now let's look at the second error in this line: > I saw where there rocket ship must have stood--a glassy, spot where > I saw where their rocket ship must have stood--a glassy spot where > http://z-m-l.com/go/plans/plansp062.html there _is_ a comma on the p-book page, and the top line accurately reflects that... there is no one disputing that reality. but since "glassy" is an adjective that describes the noun "spot", there is absolutely no call for a comma between the two. so it's clearly an error. i _did_ consider the possibility that this author was using "glassy" as a _noun_ that described these "spots" where a rocket ship had landed, but none of the other occurrences of that term supported that theory... so this is an obvious error. so far, 4 clear errors out of 4 examined. *** this is fun. let's do another one... > must be a new, popular song. He had heard so few new songs. > must be a new popular song. He had heard so few new songs. > http://z-m-l.com/go/plans/plansp068.html on this page, there is a splotch, but clearly no comma after "new". (there's no ink below the baseline, let alone space for the comma.) the "rules" about the use of commas to separate multiple adjectives are not clear-cut enough to instruct us _unequivocally_ in this case. so if you really wanted to call the absence of that comma "an error" and you chose to introduce a comma to "fix the typesetter's error", i couldn't really _argue_ with that. but in my opinion, i don't think it's _necessary_, so i would've let this stand as it was in the p-book. i am reluctant to call something an error unless it's clearly an error, because that's a nicety i think we should extend to the old-timers... after all, they did all their work without our newfangled computers. still, i'm willing to grant that this one is "in the eye of the beholder". i'm not clearly wrong, and neither are you, so i'd call this one a draw. so, 4 clear errors out of 5 examined, with 1 "eye of the beholder". i'd say, at the halfway point, my error-report looks very accurate... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080418/31c18271/attachment.htm From hart at pglaf.org Fri Apr 18 11:53:50 2008 From: hart at pglaf.org (Michael Hart) Date: Fri, 18 Apr 2008 11:53:50 -0700 (PDT) Subject: [gutvol-d] !@! Re: whitewasher stops perpetual proofing machine In-Reply-To: <001d01c8a17e$ea753440$6601a8c0@ahainesp2400> References: <001d01c8a17e$ea753440$6601a8c0@ahainesp2400> Message-ID: I try to let most of this pass, and usually successfully, and I know bb would prefer I kept silent as well, but one time out of so many I just have to say something, defend, as it were, the average reader, who may not recognize the depths to which the conversation has gone. My apologies to Al Haines, I don't really know him much-- even the the virtual sense, you haven't said anything the worse than many others have, but some need correction. On Fri, 18 Apr 2008, Al Haines (shaw) wrote: > It should be pointed out that you didn't "submit" them at all. > You simply trumpeted them all over this forum, after which *Greg* > forwarded them to the errata list. Let's not de-evolve into a semantic cesspool. Anyone who sends any of us a suggestion for correction has "submitted" it, and if WE do not pick up the ball, the WE are at fault, not them, not specifically bb. Someone points out a possible error, and YOU do not see it through to correction, if it turns out that way, YOU are the one at fault. . . . I know that even I, the founder of PG, have to remind a person I sent in error reports to for MONTHS before the error is corrected, and I feel ashamed about it. Greg has suggest that I go back to fixing errors myself as I enjoy it, but I get too many complaints that my vi editor doesn't leave the file exactly they way desired, or I would. . .it's so much simpler doing it myself. > Let's assume (*not* stipulate) that some, not all, of these errors > are actually errors. So what? The handful of items that you > found do not in any way spoil the book for anyone's casual > reading, except maybe yours. Here I could not agree with Al more completely, but I also believe in the incremental step error correction even if it reminds me of Xeno's Paradox. > (BTW - just in that original handful > of page scans, *I* spotted an "item" *you* did not. (You would > have mentioned it if you had.) So much for *your* expertise... > (Sorry, no hints.) This is plainly and simply unacceptable. If YOU have found a potential error, YOU are responsible to fix it even if it takes sending it on to someone you nag for months. This is hardly original throwing of clods from sidelines, world famous journal editors have said the same, without, senselessly, naming the errors. In one case, where their error list was reported to be in the hundreds, I offered, literally, a dollar per error, after looking myself. THAT shut up this world famous pain in the ass. . . . Never heard from him again. . . . > > It should be also pointed out that the *only* person who has *any* > say on fixing these or any other items, is the original submitter. FALSE. Anyone is allowed to create their own edition. After all, it's public domain information. > Until you can demonstrate that you can produce error-free > submissions, you don't have a leg to stand on in criticizing > others' submissions. This sentence was the final reason I had to respond. It is an example of the most elementary kind of fallacy. Just because YOU can't do a book with NO errors means YOU can't correct the errors you DO find? This is the kind of illogical thinking that was used to try to stop me from starting PG at all! "If you can't do it perfectly, don't do it at all. . .!!!" Sorry Al, I really have to take issue with it. > To slightly re-cast a phrase, "Those that can, do; those that > can't or won't, preach." Too bad the choir has left the > building... I wonder if you realize the insult you are trying to add to some pretense of injury reads more about yourself than bb? Not to mention the rest of us on this list. Using such fallacious statements just might turn the tables on yourself, and the rest. Michael S. Hart Founder Project Gutenberg > > Al > > ----- Original Message ----- > From: Bowerbird at aol.com > To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com > Sent: Friday, April 18, 2008 10:32 AM > Subject: Re: [gutvol-d] whitewasher stops perpetual proofing machine > > > and here's the next post in our series looking at those ten corrections that i > submitted for "planet strappers", to see which are "in the eye of the beholder". > > ok, now let's look at the second error in this line: > > I saw where there rocket ship must have stood--a glassy, spot where > > I saw where their rocket ship must have stood--a glassy spot where > > http://z-m-l.com/go/plans/plansp062.html > > there _is_ a comma on the p-book page, > and the top line accurately reflects that... > > there is no one disputing that reality. > > but since "glassy" is an adjective that describes the noun "spot", there is > absolutely no call for a comma between the two. so it's clearly an error. > > i _did_ consider the possibility that this author was using "glassy" as > a _noun_ that described these "spots" where a rocket ship had landed, > but none of the other occurrences of that term supported that theory... > > so this is an obvious error. > > so far, 4 clear errors out of 4 examined. > > *** > > this is fun. let's do another one... > > > must be a new, popular song. He had heard so few new songs. > > must be a new popular song. He had heard so few new songs. > > http://z-m-l.com/go/plans/plansp068.html > > on this page, there is a splotch, but clearly no comma after "new". > (there's no ink below the baseline, let alone space for the comma.) > > the "rules" about the use of commas to separate multiple adjectives > are not clear-cut enough to instruct us _unequivocally_ in this case. > > so if you really wanted to call the absence of that comma "an error" > and you chose to introduce a comma to "fix the typesetter's error", > i couldn't really _argue_ with that. but in my opinion, i don't think > it's _necessary_, so i would've let this stand as it was in the p-book. > i am reluctant to call something an error unless it's clearly an error, > because that's a nicety i think we should extend to the old-timers... > after all, they did all their work without our newfangled computers. > > still, i'm willing to grant that this one is "in the eye of the beholder". > i'm not clearly wrong, and neither are you, so i'd call this one a draw. > > so, 4 clear errors out of 5 examined, with 1 "eye of the beholder". > > i'd say, at the halfway point, my error-report looks very accurate... > > -bowerbird > > > > ************** > Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. > (http://autos.aol.com/used?NCID=aolcmp00300000002851) > > > ------------------------------------------------------------------------------ > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Fri Apr 18 11:58:30 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 18 Apr 2008 14:58:30 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: al said: > It should be pointed out that you didn't "submit" them at all.? > You simply trumpeted them all over this forum, after which > *Greg* forwarded them to the errata list. you're right about that. i consider that sending error-reports to the p.g. listserve should be considered to be giving p.g. sufficient notice... and -- if we _really_ want to fix the errors -- it should be. because it if has to be sent to one specific e-mail address for some bureaucratic reason, then _someone_ here will care enough to hit "forward" and make sure it's sent there. in this case, it was greg n. yay greg... > Let's assume (*not* stipulate) that some, not all, > of these errors are actually errors. al, you don't have to "assume" it, or "stipulate" it either... i'm stepping through them one by one, discussing them, showing the same logic publicly that i practiced in private while i was in the act of composing the list in the first place. i know you would have preferred if i had done that hastily, because then you could criticize me for that. but i didn't... > So what?? The handful of items that you found to me, "handful" means 5 or less, like your thumb and fingers. it's very interesting to see how different people define words... > So what?? The handful of items that you found do not > in any way spoil the book for anyone's casual reading i never said they did, al. in fact, i'll say that they do not... and i'm sure the errors you found in my one submission didn't "spoil" the book for anyone's casual reading either. but hey, we don't fix errors because they might "spoil" a book. we do it because (1) we can, and (2) it's the right thing to do... > (BTW - just in that?original handful of page scans, > *I* spotted an "item" *you* did not.? that's entirely possible, al... i'm nowhere close to infallible. > (You would have mentioned it if you had.) i wouldn't count on that if i were you... ;+) sometimes i keep errors to myself, save 'em for later use... > So much for *your* expertise...? (Sorry, no hints.) none of the items i reported reflect _my_ expertise. all of them were things that were found in the _test_ d.p. is running -- the "perpetual p1" test -- where they are putting an e-text through p1 repeatedly... i've been reporting the results from that test here... a dozen or more posts already, how'd you miss 'em? :+) _that's_ how i knew about the errors in the e-text. because proofers found them. (ok, i found _one_ error myself, if we want to get picky for the record.) i proofed some pages in the book, but not all of 'em. and i might or might not have run the book through my battery of tests, i'm not saying anything about that. yet... :+) i don't have to commit. but you did, by virtue of posting. at that point, your judgment was that the book was clean. at least clean enough to post. and i agree with you on that. it was clean enough to post. it had 10 errors, in 160 pages, or 1 every 16 pages, which i think is clean enough to post... and, by the way, i also noticed all the errors that were _fixed_ before the text was posted. i don't know if greg w. found them, or if you did, or a little bit of both, but whoever found 'em found a good number of 'em. i could give you a _list_ of them, if you wanted me to; see, i've been keeping pretty close tabs on this book. > It should be also pointed out that the *only* person > who has *any*?say on fixing these or any other items,? > is the original submitter. wow. that's the first i've heard of _that_ policy... does that mean if you can't find "the original submitter", then the text is frozen on any future error corrections? anyway, i'm sure greg w. will incorporate the corrections. after all, he made a commitment to do that, in this test... proofers are told directly their work will be incorporated. > Until you can demonstrate that you can > produce error-free submissions, you don't have > a leg to stand on in criticizing others' submissions. well, see, that's the problem with "win friends" people. unless they get the mandatory sugar-coated compliment before anything else, they'll turn hostile very quickly and interpret _everything_ that anyone says as "criticizing"... my error-report started out as just that -- an error-report. it wasn't a "criticism"... it was just a very simple heads-up... i lamented the fact that the corrections over which _many_ proofers have labored were not all included in the e-text before it got all the downloads from the boingboing post, but it certainly wasn't a "criticism". it was merely a lament. > To?slightly re-cast?a phrase, > "Those that can, do; those that can't or won't, preach."? > Too bad?the choir has left the building... like i said, carnegie followers get really hostile when they realize their little mind-games aren't working on someone. it shows how shallow their "be nice" _strategy_ really runs... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080418/c8ea918a/attachment.htm From Bowerbird at aol.com Fri Apr 18 12:12:31 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 18 Apr 2008 15:12:31 EDT Subject: [gutvol-d] =?iso-8859-1?q?!=40!_Re=3A=A0_whitewasher_stops_perpet?= =?iso-8859-1?q?ual_proofing_machine?= Message-ID: sorry, michael, i'd hit send before getting your post. but i still would have sent my response anyway... because al has to learn, just like everybody else, that if he tries to make the discussion about _me_ he's going to _lose_, and after that, i'll ultimately return the list to discussions that _are_ on-topic... and al, just like i'm immune to the sugar-coating, i am also immune to the hostility. you will learn... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080418/370d95ee/attachment.htm From joshua at hutchinson.net Fri Apr 18 12:22:56 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Fri, 18 Apr 2008 19:22:56 +0000 (GMT) Subject: [gutvol-d] !@! Re: whitewasher stops perpetual proofing machine Message-ID: <1327106929.675031208546576130.JavaMail.mail@webmail08> Since Michael and I do have a small history of public disagreement, I hope people won't assume I'm writing just to be contrary. Rather, I'm writing this to defend Al (not that I think he really needs it, but it's always nice to have someone step up beside you). I also apologize about the quote formatting... this mail client is horrible for responding to messages. On Apr 18, 2008, hart at pglaf.org wrote: On Fri, 18 Apr 2008, Al Haines (shaw) wrote: > It should be pointed out that you didn't "submit" them at all. > You simply trumpeted them all over this forum, after which *Greg* > forwarded them to the errata list. Let's not de-evolve into a semantic cesspool. **Josh** The difference is that bb did not put forward those errors with the intent to improve the final product. He put them forward with the intent to "rub noses" in the errors. bb, especially, knows exactly who to contact to get errors fixed. He's no newbie around here. Fixes were never his intention. Anyone who sends any of us a suggestion for correction has "submitted" it, and if WE do not pick up the ball, the WE are at fault, not them, not specifically bb. **Josh** This I agree with 100%. Michael has always been a do'er, not a just a talker. bb is a talker, never a do'er (the most you get out of him are hints that he's done something but he's not going to share). bb, at this point, is as morally responsible to getting things fixed as any one else (and if sheer volume of posts are a good scale, MORE responsible). Greg has suggest that I go back to fixing errors myself as I enjoy it, but I get too many complaints that my vi editor doesn't leave the file exactly they way desired, or I would. . .it's so much simpler doing it myself. **Josh** I second Greg's motion. Get in there and make some fixes, Michael! You're good at them, well-qualified! If you're worried about vi messing up the line-feeds ... sending the fixed file my way. Fixing that stuff is easy as hell and I'd be glad to give them a quick fix and post them. You can even consider me a sanity check to make sure you didn't zip the wrong file (yes, I speak from experience here!) > (BTW - just in that original handful > of page scans, *I* spotted an "item" *you* did not. (You would > have mentioned it if you had.) So much for *your* expertise... > (Sorry, no hints.) This is plainly and simply unacceptable. **Josh** Well, maybe unacceptable is terms that it is a bit childish to say. But, given the terms bb has been labelling people with lately, rather mildly so. And really, rather understandable. > > It should be also pointed out that the *only* person who has *any* > say on fixing these or any other items, is the original submitter. FALSE. Anyone is allowed to create their own edition. After all, it's public domain information. **Josh** Yes, that's true. But PG tends to have certain quality controls in the form of the WW'ers (which means SOMEONE has final say on fixes). And in this case, isn't the original submitter going to be updating the file based on the results of an on-going quality experiment at DP? So, waiting on the original submitter, in THIS case, is correct. > Until you can demonstrate that you can produce error-free > submissions, you don't have a leg to stand on in criticizing > others' submissions. This sentence was the final reason I had to respond. It is an example of the most elementary kind of fallacy. **Josh** Maybe the better wording would be that until bb does SOMETHING, ANYTHING regarding a book instead of just wasting people's time, he doesn't have a leg to stand on. I admit, I have a blind spot with regard to bb. If he said the sky is blue and the grass is green, I start looking for the trick, the lie, the insult. So maybe Al's comment directed to anyone else would strike me as out of line. But in this context, I'd said it was not nearly enough. Joshua Hutchinson From Bowerbird at aol.com Fri Apr 18 12:38:33 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 18 Apr 2008 15:38:33 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: ok, let's review a few more of the items on that error-report that i "created"... here's the next one on my list: > threat of slow dying, an ordeal, as the sagging dome was torn from above >? threat of slow dying, in ordeal, as the sagging dome was torn from above >? ?http://z-m-l.com/go/plans/plansp070.html ok, in this case, the e-text producer made a change to what's on the page. there's no arguing about that. however, i'm not convinced that what was on the page was _an_error_... if you look up the definition for "ordeal", you'll find that the word refers to the "tests" that were given in previous centuries to determine whether people were "witches" or "possessed" in some manner. these tests were being boiled in a big pot, or being thrown into a river, or what have you. the state of being tested in such a way was known as being "in ordeal", much like today we use the phrase "in pain"... i believe the author was using that form of the phrase, in reference to the state that the character was in -- "in ordeal", and not pointing to the problem itself, which would be the case had he used "an ordeal". you might not be convinced by that argument, though... you might wanna say this is an "eye of the beholder" case. but i think if you're gonna make an active _change_ to what is printed, you need to _make_sure_ there is _little_ uncertainty about the error... i think in this case, reticence is the right conclusion, which is why i advised a stet on this, reverting the change back to the original. so i'm gonna count that in my favor, making me 5 of 6 now... and here's the next item: >? lunar wilderness... What a switch--didn't think you'd goof! The >? lunar wilderness... What a sitch--didn't think you'd goof! The >? http://z-m-l.com/go/plans/plansp073.html ok, the scan is very clear here that "sitch" is what was on the page. so this is another case where the e-text producer made a change. however, once you know that "sitch" is shorthand for "situation", you'll realize that that is exactly what the author intended, and so this is _not_ an error, and thus should _not_ have been changed... > http://www.urbandictionary.com/define.php?term=sitch don't feel bad. i didn't learn this slang term until fairly recently... today's interesting -- 2 changes to what was actually on the page, changes which -- upon review -- should _not_ have been made... 6 of 7 correct so far... we'll finish this up next week... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080418/cbf9b5fd/attachment.htm From hyphen at hyphenologist.co.uk Sun Apr 20 00:12:16 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Sun, 20 Apr 2008 08:12:16 +0100 Subject: [gutvol-d] !@! Re: whitewasher stops perpetual proofing machine In-Reply-To: <1327106929.675031208546576130.JavaMail.mail@webmail08> References: <1327106929.675031208546576130.JavaMail.mail@webmail08> Message-ID: <000001c8a2b5$de233a10$9a69ae30$@co.uk> Joshua Hutchinson wrote: On Fri, 18 Apr 2008, Al Haines (shaw) wrote: >> It should be pointed out that you didn't "submit" them at all. >> You simply trumpeted them all over this forum, after which *Greg* >> forwarded them to the errata list. >Let's not de-evolve into a semantic cesspool. >**Josh** The difference is that bb did not put forward those errors with the intent to improve the final product. He put them forward with the intent to "rub noses" in the errors. bb, especially, knows exactly who to contact to get errors fixed. He's no newbie around here. Fixes were never his intention. Wrong!" His intention is clearly to improve the **system** not correct individual errors. A valid and even laudable aim. Dave Fawthrop From ralf at ark.in-berlin.de Sun Apr 20 02:30:41 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Sun, 20 Apr 2008 11:30:41 +0200 Subject: [gutvol-d] information needed Message-ID: <20080420093041.GA18521@ark.in-berlin.de> Hello, can you please drop me a short note if you're having connectivity problems with gutenberg.org and/or pgdp.net? Regards, ralf From prosfilaes at gmail.com Sun Apr 20 10:29:55 2008 From: prosfilaes at gmail.com (David Starner) Date: Sun, 20 Apr 2008 13:29:55 -0400 Subject: [gutvol-d] !@! Re: whitewasher stops perpetual proofing machine In-Reply-To: <000001c8a2b5$de233a10$9a69ae30$@co.uk> References: <1327106929.675031208546576130.JavaMail.mail@webmail08> <000001c8a2b5$de233a10$9a69ae30$@co.uk> Message-ID: <6d99d1fd0804201029x5bc59a05wedbbf299a4a901f9@mail.gmail.com> On Sun, Apr 20, 2008 at 3:12 AM, Dave Fawthrop wrote: > Wrong!" > > His intention is clearly to improve the **system** not correct individual > errors. > A valid and even laudable aim. I think that's an incredibly generous interpretation, given his prior history. My interpretation is that he was digging up anything to discredit those who disagrees with. The lack of constructive comments that could improve the system seem to back up my interpretation. From hart at pglaf.org Sun Apr 20 12:12:39 2008 From: hart at pglaf.org (Michael Hart) Date: Sun, 20 Apr 2008 12:12:39 -0700 (PDT) Subject: [gutvol-d] !@! Re: whitewasher stops perpetual proofing machine In-Reply-To: <6d99d1fd0804201029x5bc59a05wedbbf299a4a901f9@mail.gmail.com> References: <1327106929.675031208546576130.JavaMail.mail@webmail08> <000001c8a2b5$de233a10$9a69ae30$@co.uk> <6d99d1fd0804201029x5bc59a05wedbbf299a4a901f9@mail.gmail.com> Message-ID: My interpretation is that you are all doing a great job of discrediting each other in this and similar conversations. It would be so much nicer if you were crediting something. On Sun, 20 Apr 2008, David Starner wrote: > On Sun, Apr 20, 2008 at 3:12 AM, Dave Fawthrop > wrote: >> Wrong!" >> >> His intention is clearly to improve the **system** not correct individual >> errors. >> A valid and even laudable aim. > > I think that's an incredibly generous interpretation, given his prior > history. My interpretation is that he was digging up anything to > discredit those who disagrees with. The lack of constructive comments > that could improve the system seem to back up my interpretation. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Mon Apr 21 08:17:35 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Apr 2008 11:17:35 EDT Subject: [gutvol-d] road-map for this week's posts Message-ID: here's a road-map for this week's posts... first, i will finish the thread looking at that error-report i've created. we will see that 9 out of 10 of the items were clearly correct, and the 10th -- which is a judgment call -- should be resolved as i suggest... next, i'll make a suggestion for an experiment that d.p. _should_ do, with text from its "parallel" test of "christopher and the clockmakers". most of their data has merely confirmed what we've already known... but there _are_ some wrinkles which would give us good information. third, i'll continue with the analysis of the data from "christopher"... i was wrong initially, as this test has yielded some fascinating data. i'd estimate there are probably another half-dozen posts on that... fourth, i'll give you a new look at the data from "planet strappers", with a perspective that shows us how little is being "accomplished" in this test of "perpetual p1", _and_ how fragile human proofing is. the moral is "even a half-dozen rounds of proofing isn't enough." i'll also discuss more issues involved in _auto-detection_ of errors, in what will become a big series of messages on _pre-processing_ as it _should_ be performed at d.p., and in any digitization effort... i'm going to discuss some actual auto-detection routines, showing how simple they are to create, and how useful their output can be. get your reg-ex on. more fun than a metal-detector at the beach! i _might_ finally post my message on "sock puppets and sabotage -- the musical!", my reply to donovan's reckless claims of a while back... i've also got a long-written message that revisits the importance of _filenaming_ considerations, in which i relate some d.p. progress in this arena, thanks to excellent programming by dkretz over at d.p. i'll also discuss some other ideas being voiced at d.p., in areas like jeroen's text-heat-map, ellipses, and other topics that are raging... plus i've got an old message i never sent that has a new relevance, on transcription versus republication, so i will try to dig that out... also, there's my response to that person who came by wanting to program a viewer-app that would structure p.g. e-texts correctly. i was seeing if anyone had the balls to respond to that. um, nope. finally, i'm responsive to anything _you_ might wanna talk about. just make a post... and -- if you want me to hold my response -- remember you can always start the subject with "bastien rules"... so there's your road-map! gonna be a very busy week! and of course, there might pop up some idiots -- many of whom will not have read all these other messages that are chock-full of content -- who will try to tell you that i am "a mean troll" who is only here to pick fights, so i will have to slap those clowns back into the shadows... or hey, maybe i'll just let dave, my new hero, do it for me... :+) all in all, just another typical week here in the lobby of the p.g. library. -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080421/8cda6d8d/attachment.htm From Bowerbird at aol.com Mon Apr 21 11:11:31 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Apr 2008 14:11:31 EDT Subject: [gutvol-d] parallel -- the plunderers -- 01 -- an upbeat post Message-ID: i know i said i start with the conclusion of my error-report verification, but i think i'd prefer to start with something a little more upbeat than "i told you so". (although i love to say that, if you know what i mean...) fortunately, there is some up-beat news. we've got a new parallel-proofing experiment over at d.p.! yay! "oh no," i can hear you groaning. "how is that good news for _us_?", you're asking. "it's just more torture via data..." well, yes and no. :+) as desperately as the "leadership" at distributed proofreaders would like for you to believe that the incompetently-prepared projects are "atypical" and "non-representative", they are altogether _too_ typical. however, well-prepared projects _also_ exist. my rough guess is that they constitute 1/3 of the projects... or half... maybe even two-thirds. that doesn't make the incompetently-done ones any less tragic, but -- for our benefit here -- it will be a pleasant change to analyze one that was prepared with _competence_, so we can skip that criticism... so -- ladies and gentlemen -- i give you "the plunderers"... the two project pages for the two parallel proofings are found here: > http://www.pgdp.net/c/project.php?id=projectID4807c5847b561 > http://www.pgdp.net/c/project.php?id=projectID4807d8eb2d3c2 as you can see, they are still being worked on, which is why i hadn't intended on talking about them quite yet, or even for a while, but it might be fun for me to get on out ahead of the proofers for once. this project comes from roger frank, who usually does good work. rfrank has programmed some excellent software for d.p. as well... first of all, most scans for this project were well-done. yay rfrank! second, rfrank used abbyy finereader to do the o.c.r. i can tell, because abbyy does a good job of recognizing "spacey quotes", the typographic convention common in older books where the double-quotemark was surrounded by spaces. unfortunately, that's not what we want in our digitizations today, because it looks funny to current readers, so we want to close those up... i'll come back to this in a minute. continuing with the good news, it appears that rfrank has done at least some preprocessing clean-up on the text. for instance, the end-of-line hyphenates have been automatically rejoined. and rfrank was smart enough to choose a book that has _zero_ ellipses in it, which means we won't have all _those_ problems... (well, at least i hope the book actually did have no ellipses in it, because if it did have any, rfrank lost them. but let's trust him.) and the good news keeps coming. i had decided that i would _not_ put up with the d.p. stupidity on filenames on this test... i've suffered through it with the last two experiments because i wanted to be able to point you to the actual scans over at d.p., but acting like i'm braindead makes me sick to my stomach, so i decided on this one i would do my typical renaming right off, and just mount the scans on my site and point you to _those_... so i downloaded the scans so as to rename them, and lo and behold, rfrank was also smart enough to choose a book where the first page of the book was labeled "page 9" simply because there had been 8 frontmatter pages. hallelujah, there is a god! so you can look at these scans on my site _or_ on the d.p. site. *** so right off, rfrank did _much_ of the content-processing well. this project will _not_ waste proofer's time and energy making thousands of unnecessary bureaucratic changes, which is good. not least of all because it'll dramatically lessen your data torture. not that the preparation was _perfect_, mind you. close. but not. i discuss this in some detail in another message which i've written and will be posting soon, so i won't go into the details on it now, but one of the first things i check on a book is the _paragraphing_. that is, i check whether all the paragraphs are delineated correctly. the easiest way to check this is to review paragraph _termination_. remember that a paragraph is indicated by a blank line, therefore any set of two consecutive linebreaks (minus certain exceptions) _not_ preceded by a "proper" paragraph termination is flagged... (a "proper" one is the period, exclamation-mark, question-mark, em-dash, semi-colon -- used for blocks -- either quoted or not.) *** the first such test isolates multiple consecutive bad paragraphs. these cases sometimes arise from the o.c.r. on block-quotations and other structures where the _leading_ diverged from default. the other frequent case of multiple consecutive bad paragraphs is a bad page-scan. in rfrank's file, this anomaly occurs on pages 182, 236, 241, 287. (actual text from these lines is appended.) i've uploaded the o.c.r. of this project so you can see those pages: > http://z-m-l.com/go/plund/plund-ocr-rfrank.txt search for 182, 236, 241, and 287 to see what triggered this test. the pattern -- excessive blank lines -- is familiar to everyone who has looked at a lot of o.c.r. results... if you want to look at the scans for the pages, you can go here: > http://z-m-l.com/go/plund/plundp182.html > http://z-m-l.com/go/plund/plundp236.html > http://z-m-l.com/go/plund/plundp241.html > http://z-m-l.com/go/plund/plundp287.html ok, page 182 looks like a good scan. the problems came in because it's a chapter-header, which often causes o.c.r. woes, especially around the drop-cap. (i don't know why drop-caps seem to be so hard for o.c.r. to grok. it's just a big letter, not?) likewise, pages 236, 241 (another chapter-head), and 287 all look ok. but i'd re-scan them anyway, in hopes that the o.c.r. was better on the re-scan... but otherwise i would do what i did in these cases -- correct the text against the scan. whenever a text takes you to an error, go ahead and correct it, even if it's not the _type_ of error that you were looking for then. otherwise, you just have to come back and relocate it later to fix. there were a total of 9 singleton bad paragraphs -- a blank line interjected into the middle of a paragraph in the o.c.r. results... easy enough to fix when the cleaning tool takes you right to 'em. (these are also listed below.) there were 10 cases where a paragraph was terminated incorrectly, usually happening when the rightmost character is misrecognized. these cases are also appended, and again were quite easy to fix... with any auto-detection routine, we will inspect the false-alarm rate, to minimize wasting our time considering flags that are _not_ wrong. in this case, so you know, there were just _4_ false-alarms, which is a number that's quite tolerable considering we fixed 2-3 dozen errors. all in all, that's the kind of tradeoff that's very good for the proofers... "well, sure," you might reply, "but it's not good for the preprocessor, because it just shifts the work to them." well, that's true, but it shifts much less work to them by making it easier for them to apply the fix. so it's cost-effective in an overall way. but even if you don't buy that, you don't have to. just include the routines in the "wordcheck" which the proofers use, so that the bugs are flagged, and can be fixed easily. in other words, put auto-detects wherever you want, but _put_them_in_. *** ok, now that we are sure we have the paragraphs correctly delineated, we can do the next fix in the battery, an auto-fix on spacey-quotes... that's right, you can write a computer routine to repair spacey-quotes. evidently, rfrank didn't know this, because he left lots of spacey-quotes in this text -- well over 318 of them, the number of pages in this book. compared to the _thousands_ of unnecessary bureaucratic changes that were required in the other books, these 318 don't seem all _that_ bad, but even _hundreds_ of unnecessary bureaucratic errors are too many. the spacey-quote repair routine goes like this: 1. split the text into paragraphs, and for each paragraph... 2. analyze the double-quotemarks as "open, closed, or spacey"... 3. check whether the open/closed ones fit the expected pattern... 4. and, if they did, then assign the spacey-quote the expected status... 5. or, if not, flag the paragraph for a human resolution of the problem... the expected pattern, of course, is "open/close//open/close//open/close", ad infinitum. so, for instance, like on page 15, if we get a pattern that goes: > 1. open > 2. close > 3. spacey > 4. close > > "If he doesn't he has a supreme nerve," the > younger man replied. " They look to me as if > they mean trouble. They're in a pretty nasty > temper--what with all the poison they've poured > in, and all the injustice they believe they have > met. Wonder who's right?" here, we can confidently switch the #3 spacey-quote to an open-quote... or consider this one, on page 18: > Without any appearance of haste, and as if > scornful of the mob that had so recently been > threatening to hang him, the man walked back > to his buckboard, climbed in, and stood there on > his feet with the reins in one hand, and the rope > in the other. " You get away from in front of > me there," he said, in his harsh, incisive voice; > "I'm tired of child's play. If you don't let me > alone, I'll kill a few of you. Now, clear out!" > > 1. spacey > 2. close > 3. open > 4. close this pattern confidently switches the #1 spacey-quote to an open-quote... there are more permutations to this routine, but we'll discuss those later. for the most part, in this book, the rule as stated here will work quite fine. *** one other big place where rfrank's preprocessing fell down was on lines starting with a dash or a hyphen. you definitely want to check these out... i'm not clear on the d.p. policy about an em-dash at the start of a line, so i brought them up, but i generally leave 'em there if they're printed there. *** the rest of the missing preprocessing was smaller stuff, routines such as garbage character removal, number-checking, inconsistent punctuation, inconsistent capitalization, stuff like that... *** before i did the last 2 big fixes -- spacey-quotes and start-line dashes -- i uploaded a version of my kinda-cleaned-up-a-iittle o.c.r. file. as z.m.l.: > http://z-m-l.com/go/plund/plund.zml you can search for spacey-quotes -- space-quotemark-space -- and see for yourself how my clever little paragraph algorithm does its magic. *** the z.m.l. file allowed me to create a full-on .html version of the book... again, this is just the o.c.r., not even proofed, so it's got errors in it, but the paragraphs are formatted correctly, even around the page-breaks... so the demonstration here is how easy it is to put the book on the web... *** as i said, this experiment just got underway, so i won't discuss the data. i just wanted to point out that it _is_ possible to prepare a project right, and it's actually done over at distributed proofreaders, even regularly... it's important to keep the focus on the ones that are _not_ done correctly, because those badly-prepared projects cause tremendous waste at d.p., but it's also important to know that not _all_ projects are always that bad. so it'll be nice to analyze the data from a well-prepared project for once. so you know, it won't change the _pattern_ we've got so far, the pattern that seems to capture a "common-sense" take, which is that p1 fixes most of the errors, p2 gets most of the remaining ones, and p3 comes in and does clean-up. again, this is the pattern you get on page after page, in book after book, day after day, over in d.p.-land... so i would be amazed if that's not what we get in this book. (well, not totally amazed. i would have an explanation for it. but let us not cross that bridge unless we actually come to it.) of course, p1 won't be making _thousands_ of "corrections"... it'll be more like _hundreds_ -- with those spacey-quotes -- but then p2 will also have less to correct -- likely dozens -- and p3 will probably end up with a number in single-digits. and if rfrank tightens up his tools just a little bit, like mine, he'll find that p1 will do dozens of changes, p2 single-digits, and p3 will end up twiddling their thumbs looking for stuff... or -- since p2 is a bottleneck too -- just route text through p1 a second time, and a third, and the book will be nearly perfect... now where have i heard _that_ idea before? -bowerbird p.s. here are the cases in "the plunderer" with bad paragraphing... -> multiple bad paragraphs -- page 182, 236, 241, 287 CHAPTER XI ##### bells' valiant fight ##### "IT ~Y TE'LL time to ##### thank you now, Mrs. Meredith, but some day ##### "You "Well," she said, " it doesn't matter. I am ##### not jeal------I'm ""T "T TOW! Somethin' seems to have kind ##### \r of livened up from ##### his regular trip underground, he stamped into the ##### office, in semblance of that cross whose name ##### was the name of the mine------ -> excess blank line needing to be removed he named came in, and he'd head us into ##### it. to the top of this divide, and then we'll know for ##### sure." to stave off an empty belly. You can go ##### now." yet I saw him do somethin' once that beat ##### me." you. You can return it to me at your con- ##### venience." smith, frowning in his face.//"Right here in this cabin. Been here two days ##### now." seems less hard to me now that I know you ##### care." "But--but------" halted the engineer. " Bill ##### said there. They seem like such lonesome, forgotten ##### cusses. -> bad termination of previous paragraph I've got it! Arope! " ##### The partners were preparing to jump forward the green of the forests below, ##### "Say, I believe you're right, Dick!" he exclaimed. of the mountain coming up to drive us off 1" ##### "Hello," hailed a shrill, quavering are what men make them, no better, no worse.'* ##### "I have made no criticism," \ ##### She checked him. engineer and a helper. It has but one mill boss.'* ##### "Working eight batteries? ground. j; ##### "Well," demanded Rogers, " what have you things.'" ##### He held his knotted, rough fingers open before CHAPTER XVI//BENEFITS RETURNED//* ##### DICK waited impatiently at the rendezvous, and the beautiful. Ah, how I love it--all! All 1" ##### Dick's arm slipped round her, -> actually correct you'd not be helplessly busted.'" ##### He jumped to his feet with an exclamation. a sort of millinery store in------ ##### Here a name had been painstakingly obliterated, Your ever grateful, ##### pearl walker. the shift up, and 'tend to the firing myself.'" ##### For an instant Dick was enraged ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080421/f4ca25a4/attachment-0001.htm From Bowerbird at aol.com Mon Apr 21 16:40:56 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Apr 2008 19:40:56 EDT Subject: [gutvol-d] whitewasher stops perpetual proofing machine Message-ID: ok, let's look at the last 3 of the items on that error-report that i created... here's #8: > highway. There were other rough stretches, but most of the well selected > highway. There were other rough stretches, but most of the well-selected > http://z-m-l.com/go/plans/plansp075.html ok, this one is straightforward. if you look at the scan, there is a splotch between "well" and "selected". so you might think the dash between them was just a misrecognition... but if you blow up that area, you can plainly see that there was indeed a dash there on the page, as there is a clear right-angle present which would not be there if it was just a splotch, at upper-left of the splotch... so i've got 7 of 8 right so far, with 1 "eye of the beholder" case... here's #9: > grubbed it, yourself? Sell it. Get the stink blown off you--forget some > grubbed it yourself? Sell it. Get the stink blown off you--forget some > http://z-m-l.com/go/plans/plansp110.html ok, there is plainly a comma there on the page. but i think it's there in error. i see no reason for a comma there... until someone can explain such a reason, i'll consider myself right. 8 of 9 right, with 1 eye-of-the-beholder... and #10: > and these lines need to be tightened: > "Serene... >?? Found a queen... >?? And her name is Eileen..." > http://z-m-l.com/go/plans/plansp068.html semantically, this is a block of 3 lines, not 3 separate paragraphs, so it's obvious that they should not be separated by empty lines... besides, if they _were_ 3 separate paragraphs, this would _still_ be in error, because then they'd have unbalanced quotemarks... 9 of 10 right, with 1 eye-of-the-beholder... *** oh, another error was found in the "perpetual" test, on page 129: > The little sun was half sunk behind the Horizon. > The little sun was half sunk behind the horizon. pretty obvious, that one. no need for a capital letter there... won't count it on my list, but included here for completeness. as for the previous new error that was found -- an unbalanced quotemark -- the posted version of this e-text had fixed that, probably due to some gutcheck goodness. way to go, gutcheck! likewise, the one "known error" in the iterations "perpetual" text -- an extra comma after "he came" on page 33 -- was also fixed in the posted version of this e-text, because it got fixed in p2... *** so let's summarize our findings on the 10 items on my report... 9 out of the 10 were double-checked as accurate assessments. so, we've got 1 eye-of-the-beholder case out of my 10 items. maybe 2, if you want to count that "in ordeal" as inconclusive. in absolutely no case did my reported item prove to be wrong. (i previously withdrew the one case that might have done so.) so i think i did a pretty good job on creating my error-report. _especially_ when you consider as well that in _both_ of those eye-of-the-beholder cases (assuming you count the second), _my_ recommendation was to _use_what_was_on_the_page_, while the original e-text producer had made _changes_ to it... surely if something is _vague_ and "in the eye of the beholder", the proper course of action is to _use_what_was_on_the_page_. so i believe i followed the right course of action on all 10 items. but, then again, of course, i _would_ believe that now, wouldn't i? because the reason i followed each one of those courses of action was precisely because i had decided it was indeed the right one... the question is, who is going to counter-argue any of the cases? yes, step right up, folks. i want to hear your counter-arguments. *** in the absence of that, however, i think we can safely decide that robert's asinine statement that i hadn't even looked at the scans has been proven to _be_asinine_. and further, greg's statement that my error-report was an example of how error-reports can be subject to the "eye of the beholder" effect was a misrepresentation, undoubtedly due to the fact that he didn't actually look at the items. but let's let robert's own words speak for themselves: > I glanced through a few of these, and of those I checked > the items marked below as "errors" are actually faithful > reproductions of the original text. It looks like the reporter didn't > bother checking the errata list against the original page images. no, robert, _you_ did not "bother" engaging your brain to answer the simple and obvious question: "was the original text an error?" and let us also let greg n.'s words speak for themselves: > The example, below, is a great demonstration of why > nobody in the processing chain is enthusiastic about > taking errata reports and applying them blindly.? It's very, very > typical for errata to be, at least partially, in the eye of the beholder. perhaps that might be true in general, but we have now demonstrated through a close look at each one of them that _this_ error-report was _not_ a demonstration at all -- let alone a "great" demonstration -- about how errata are "in the eye of the beholder". not in the slightest. these errors were completely clear to anyone with an eye and a brain... *** so if there _is_ something troubling about this tempest in a teapot, it is precisely that many of the whitewashers seem to have decided error-reports are inherently untrustworthy and not worth the work, or that they will use the flimsiest of excuses to make that judgment. maybe that's why error-reports are acted upon so slowly. and maybe that's why the public feels no need to make such reports. instead, they just complain about how the e-texts are full of errors... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080421/066e99af/attachment.htm From Bowerbird at aol.com Tue Apr 22 09:31:04 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Apr 2008 12:31:04 EDT Subject: [gutvol-d] a parallel test that would provide juicy info Message-ID: road-map said: > next, i'll make a suggestion for an experiment that d.p. _should_ do, > with text from its "parallel" test of "christopher and the clockmakers". > most of their data has merely confirmed what we've already known... > but there _are_ some wrinkles which would give us good information. ok, this one is extremely straightforward. "christopher and the clockmakers" had a "normal" p1->p2->p3 workflow. it then repeated a brand-new p1->p2->p3, which i called r1->r2->r3... this showed each of the workflows produced a good, but not perfect, text. one still had 8-10 errors in it after the normal 3 rounds of proofing, and the other one had 13 errors remaining. again, pretty good, but not great. given the difficulty of pushing a book through p3 once, let alone _twice_ -- and no perfect text coming out where the sausage is supposed to be -- it's worthwhile asking if 3 rounds (or so) of p1 could do the job as well... so the logical test would be to give the p1 output more spins through p1. also, since we have it, and it's perfectly good, do the same with r1 output. if 2-3 more iterations of p1 makes the p1 output as good as p1->p2->p3 -- and does the same by perfecting the r1 output as well as r2 and r3 did -- then there's really no need to make books sit in the long p2 and p3 queues. a 2x2 test -- 2 normal flows cross with 2 iterative flows -- would maximize the strength of the comparisons both within and across the different flows... so hey guys, please consider doing that... that's all... thank you very much... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080422/a35caa7d/attachment.htm From Bowerbird at aol.com Tue Apr 22 10:25:01 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Apr 2008 13:25:01 EDT Subject: [gutvol-d] !@! Re: whitewasher stops perpetual proofing machine Message-ID: gosh, since michael quoted his message in full, which had been delivered to my spam folder, i can see that david starner said that there's a "lack of constructive comments that could improve the system" in my messages... that's the reason david now lives in my spam folder. is there even any reason to respond to such dreck? well _sure_, because what the heck? i can't let dave and michael have all the fun of shooting barrel-fish. heck, let's let david's words speak for themselves, ok? > I think that's an incredibly generous interpretation, > given his prior history. My interpretation is that > he was digging up anything to discredit those who > disagrees with. The lack of constructive comments > that could improve the system seem to > back up my interpretation. first of all, david, it's _laughable_ that you seem to believe that the misperceptions of me that you have perpetrated in the past are my "prior history". wake up, mean man... it was clear all along to everyone who was reading along that i do good solid work and have a handle on the truth, while my detractors attained nothing above namecalling. and given the long record at this time, there is no doubt. yet you seem to think you can still sneak in an insult. ha! i suggest you think how the archives will look in 10 years, or 20, or 30, when people have 20/20 hindsight to know that i was right on all these points, in a way that will seem _obvious_ to them at that time. they're gonna wonder both how you could be so stupid as to not see what was obvious, but also how you could be so mean in making your attacks... heck, anyone can be wrong. but to be wrong _and_ haughty? well, _that_ takes a special kind of person, david. real special. *** and this part is really classic as well: > he was digging up anything to > discredit those who disagrees with. let's look past your tortured sentence syntax, david. you seem to think we're still engaged in "disagreement", that all of this controversy is a difference of _opinion_... and it's true that i spun the game that way for a long time, to entice you poop-heads into throwing all your credibility into the poker pot so that when i revealed my winning hand, i would walk away with it all, and you would have nothing left. but the cards are showing now, david... we're no longer in the land of _opinion_. we're dealing with _facts_ now. we're dealing with _data_. we're dealing with _the_truth_. it's no long sufficient to say "i don't agree with that", not unless your version of the story can also explain the data. and your insults fall _far_ short... *** but this part is the best: > The lack of constructive comments > that could improve the system i would assume almost everyone else has been able to determine the constructive suggestions from my posts, but here are some, for the record, for the logic-impaired... 1. familiarize yourself with the book before you scan it. review how many numbered pages it has, whether it has any unnumbered pages located in the body of the book, and what frontmatter pages you'll want to be scanning... give it a working title -- i prefer a 5-character string -- e.g., "booke", and scan the first _numbered_ page, with a filename of "bookep001.png" (or .jpg, or whatever)... the o.c.r. program will increment the name automatically, so continue with the numbered pages, skipping over any unnumbered pages. since filenames match pagenumbers, you'll know right away if you accidentally skip or duplicate. once you're done with the numbered pages, go back and scan unnumbered pages, manually naming them correctly, such that they'll sort to their proper place in the directory. then scan the frontmatter, starting with "bookef001.png". 2. make good scans. straighten them, and crop them, if you really want to feel like you have done a good job. but most of all, just make sure you don't screw 'em up. people have to look at the scans, but not all that much, so as long as they're reasonably decent, good enough. (if you want to see a _great_ job of making a scan-set, download any scan-set from _nicholas_holdson_ that you can find in the internet archive. nicholas has done over _400_ books by himself -- from start to finish -- and his scan-sets are a product of breathtaking beauty.) 3. do the o.c.r. using a good o.c.r. program, like abbyy. although it should go without saying, don't use a _dog_ like tesseract, which will make hundreds more scannos... 4. do some initial preprocessing with the explicit goal of determining any pages that returned especially bad o.c.r. a good rule of thumb: if over half the lines have an error, re-scan that page. even if a page _looks_ perfectly good, the important thing is the o.c.r. that it returns. and often, a careful re-scan will produce better o.c.r. but if a page still doesn't return good o.c.r. even after it's re-scanned, try manipulating its flaws away using an image program. 5. if there are more than a few pages left with bad o.c.r., scuttle the project. find a better copy of the book to scan, or wait until you learn the trick of how to fix the images... 6. once you've got acceptable o.c.r. from all your pages, then you can begin doing the full-on preprocessing run. some preprocessing routines operate at the word-level, some at the paragraph level, some at the page-level, and some at the chapter-level. so one thing you've got to do is to make sure that all of the chapter headings are right, and your preprocessing tool is recognizing 'em correctly. but that's pretty easy. a little bit harder is to ensure that the _paragraphs_ were recognized by the o.c.r. correctly. a good tool will help you quickly find paragraph glitches, where perhaps the o.c.r. inserted excessive blank lines... what also pops out at this time are the o.c.r. errors where the paragraph-terminating character was misrecognized. 7. now you will continue with the full-on preprocessing. d.p. has enough knowledge to do _great_ preprocessing. that knowledge is summarized in the regular expressions now in gutcheck, and those collected by dkretz and rfrank. put that knowledge to use and save unnecessary proofing. if you claim that you do not have the expertise to develop a good preprocessing tool, i call _bullshit_, and advise you to come here to this listserve where i will help you create it. but as i just said, you've got more than enough knowledge. oh, and this should go without saying (but obviously not), but do _not_ make errors during the preprocessing which "inject" hundreds, even thousands, of errors into the text, such as the mistake one preprocessor made when he did a global replace changing the em-dashes to en-dashes... 8. specific to your workflow at distributed proofreaders, do _not_ have the proofers rejoin end-of-line hyphenates. if that must be done -- and, really, it _should_not_ be done unless an end-user wants it, and if so, it can be done then) -- then do it _after_ the proofing has been accomplished, by writing computer-routines to handle it. if you need help writing those routines, bring the discussion to this listserve, and i will provide help. retaining linebreaks aids proofing... once you stop rejoining hyphenates, those changes of the _asterisks_ on the hyphenates will come to a merciful stop. (it's extremely amusing -- provided one isn't astounded -- to see the meaningless changes as they accumulate across the rounds: the first proofer will rejoin the hyphenate, and then the next one will add in an asterisk, and then the third takes out the asterisk, and the fourth puts it back in... silly!) 9. again specific to d.p. workflow, adopt a sensible policy on ellipses. i've detailed such a policy here, the one which i follow, which is to change all 4-dot ellipses to 3-dot ones. (since what's the ultimate difference? that's what i thought.) i also delete any spaces between the dots, because you don't want to have a linebreak introduced between them on rewrap. i also "de-float" ellipses, attaching 'em to the preceding word unless in the p-book they were clearly attached _only_ to the _following_ word, in which case i do that instead. (this latter case is what happens to an ellipse at the beginning of a line.) i won't bother listing all the other unnecessary bureaucratic changes required by d.p. workflow, but get rid of all of them. 10. again specific to d.p. workflow, empower your proofers. trust them to make good decisions, and have them execute. "making a note" about each correction that should be made is a big waste of time and energy, because the next person just has to read the note, process its contents, act upon it to make the change, and erase the note, a bunch of work that _can_ be avoided, and _should_ be. the post-proofing review will thoroughly list all the changes that were made, and any of 'em that were done in error can be reversed then. by the way, you would have picked up on this particular hint if y'all had considered duguid's "first monday" article carefully, since he discussed this very type of excessive communication. 11. do a review of every change that was made to every page. this check shouldn't re-proof the page, just verify the change. 12. move to a roundless system. you can even use the current architecture, just by making two simple change. first, pretend you "officially" have a 10-round system. then, when a project moves to a new round, every page which has been a "no-diff" on the previous 2 rounds is marked as "done", which means that only pages with a recent diff will be proofed in that round. *** anyway, there's 12 points of constructive criticism to chew on. there'll be more to come, when i start discussing _formatting_, but since we've only discussed _proofing_ so far, i'll stop there. -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080422/f55c9434/attachment-0001.htm From jmdyck at ibiblio.org Tue Apr 22 13:19:47 2008 From: jmdyck at ibiblio.org (Michael Dyck) Date: Tue, 22 Apr 2008 13:19:47 -0700 Subject: [gutvol-d] reaching 25k (The PG Monthly Newsletter, April 21, 2008) In-Reply-To: References: Message-ID: <480E4863.5060205@ibiblio.org> Yesterday, on the gweekly list, Michael Hart wrote: > As of today, April 21, 2008, original "Project Gutenberg > eBook" site totals have reached 25,000, having passed on > from 24,998 in the period since yesterday's totals for a > new total of 25,004. At the time, the highest-numbered text posted was #25119, suggesting that 115 lower numbers are not currently assigned to eBooks. But back on Jan 23rd (in the last weekly newsletter edited by Mike Cook) http://www.gutenberg.org/newsletter/archive/PGWeekly_2008_01_23.txt that quantity ("reserved/pending") was only 47. (Highest ebook number was #24405, count was 24,358.) It seems odd that PG would have reserved 68 more numbers since then. In fact, looking at the 'posted' list, I don't think any more numbers have been reserved. If anything, some formerly reserved numbers might now be in use. So the reserved count is still somewhere in the mid-forties (with small transient fluctuations). I'd say the count hit 25k sometime on April 11 and is now about 25,090. -Michael From Bowerbird at aol.com Tue Apr 22 13:24:41 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Apr 2008 16:24:41 EDT Subject: [gutvol-d] PG ebooks on OLPC XO Message-ID: maybe everyone missed this post earlier... > Hi! > > I am planning on writing a Python program to read PG ebooks, for > eventual use on the OLPC XO laptop, and for use on Macintosh > computers (two platforms that use Python natively for GUI programs). > > What I want the program to do is to read PG ebooks in ascii form, > interpret them semantically and then display semantic elements > according to either the user's preferences, or the requirements > of the platform.? What I mean by semantic interpretation is that > the program would read the book and know what the parts are > conceptually (chapter, author, title, description, illustration, footnote), > not just what they might look like in the original, visual-appearance-wise.?? > This will allow the program to include GUI elements that are generally > not available to web-browsers and other text readers, including most > HTML representations: for example, it would allow margin notes to > appear at the margin, footnotes to appear at the bottom of the page > that the user sees (regardless of font size) and things like that. > > Anyone want to help me do this, or have suggestions?? The project would > be open-source (probably BSD-type license or public domain, not sure yet). > > The first step in this project for me will be to define a grammar that > can be used to build a parser (I don't want to do this heuristically).? > I'm thinking that a good starting point will be to create a formal grammar, > if one does not already exist, that describes the formatting standards > on the pgdp.net website (http://www.pgdp.net/c/faq/document.php). > From reading the pgdp.net website, it seems that a good number of > PG books are being submitted now with that format. > > Does anyone know if other well-defined formatting standards exist > within the PG library?? Does anyone want to guess about the percentage > of PG ebooks that might validly comply with some kind of formal grammar > that could be precisely defined (I'm thinking that several grammars could > be used if necessary)? > > I'm fully aware that this formatting issue has come up before in these > mailing lists, and that, as a whole, the PG library is a bit of a mess > formatting-wise, but I want to give it a shot anyway. > > Any thoughts out there?? Anyone want to take the plunge with me? doesn't anyone have anything to say to this person? -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080422/7ec98d95/attachment.htm From ajhaines at shaw.ca Tue Apr 22 13:42:05 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 22 Apr 2008 13:42:05 -0700 Subject: [gutvol-d] PG ebooks on OLPC XO References: Message-ID: <001c01c8a4b9$5395a440$6501a8c0@ahainesp2400> Ralf Stephan responded on April 5. ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Tuesday, April 22, 2008 1:24 PM Subject: Re: [gutvol-d] PG ebooks on OLPC XO maybe everyone missed this post earlier... > Hi! > > I am planning on writing a Python program to read PG ebooks, for > eventual use on the OLPC XO laptop, and for use on Macintosh > computers (two platforms that use Python natively for GUI programs). > > What I want the program to do is to read PG ebooks in ascii form, > interpret them semantically and then display semantic elements > according to either the user's preferences, or the requirements > of the platform. What I mean by semantic interpretation is that > the program would read the book and know what the parts are > conceptually (chapter, author, title, description, illustration, footnote), > not just what they might look like in the original, visual-appearance-wise. > This will allow the program to include GUI elements that are generally > not available to web-browsers and other text readers, including most > HTML representations: for example, it would allow margin notes to > appear at the margin, footnotes to appear at the bottom of the page > that the user sees (regardless of font size) and things like that. > > Anyone want to help me do this, or have suggestions? The project would > be open-source (probably BSD-type license or public domain, not sure yet). > > The first step in this project for me will be to define a grammar that > can be used to build a parser (I don't want to do this heuristically). > I'm thinking that a good starting point will be to create a formal grammar, > if one does not already exist, that describes the formatting standards > on the pgdp.net website (http://www.pgdp.net/c/faq/document.php). > From reading the pgdp.net website, it seems that a good number of > PG books are being submitted now with that format. > > Does anyone know if other well-defined formatting standards exist > within the PG library? Does anyone want to guess about the percentage > of PG ebooks that might validly comply with some kind of formal grammar > that could be precisely defined (I'm thinking that several grammars could > be used if necessary)? > > I'm fully aware that this formatting issue has come up before in these > mailing lists, and that, as a whole, the PG library is a bit of a mess > formatting-wise, but I want to give it a shot anyway. > > Any thoughts out there? Anyone want to take the plunge with me? doesn't anyone have anything to say to this person? -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080422/51438ec7/attachment-0001.htm From Bowerbird at aol.com Tue Apr 22 14:28:44 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Apr 2008 17:28:44 EDT Subject: [gutvol-d] PG ebooks on OLPC XO Message-ID: al said: > Ralf Stephan responded on April 5. and ralf completely misunderstood the point of the questions. -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080422/c6018304/attachment.htm From hart at pglaf.org Tue Apr 22 20:01:56 2008 From: hart at pglaf.org (Michael Hart) Date: Tue, 22 Apr 2008 20:01:56 -0700 (PDT) Subject: [gutvol-d] reaching 25k (The PG Monthly Newsletter, April 21, 2008) In-Reply-To: <480E4863.5060205@ibiblio.org> References: <480E4863.5060205@ibiblio.org> Message-ID: Since I stopped counting every single entry by hand, and then comparing notes with at least one other human count, we have had to rely on an automated count program that is sometimes too high and sometimes too low. If anyone would care to heolp do a serious counting effort, I would be only too happ. . . . Thanks!!! Michael On Tue, 22 Apr 2008, Michael Dyck wrote: > Yesterday, on the gweekly list, Michael Hart wrote: > >> As of today, April 21, 2008, original "Project Gutenberg >> eBook" site totals have reached 25,000, having passed on >> from 24,998 in the period since yesterday's totals for a >> new total of 25,004. > > At the time, the highest-numbered text posted was #25119, suggesting > that 115 lower numbers are not currently assigned to eBooks. > > But back on Jan 23rd (in the last weekly newsletter edited by Mike Cook) > http://www.gutenberg.org/newsletter/archive/PGWeekly_2008_01_23.txt > that quantity ("reserved/pending") was only 47. (Highest ebook number > was #24405, count was 24,358.) It seems odd that PG would have reserved > 68 more numbers since then. > > In fact, looking at the 'posted' list, I don't think any more numbers > have been reserved. If anything, some formerly reserved numbers might > now be in use. So the reserved count is still somewhere in the > mid-forties (with small transient fluctuations). > > I'd say the count hit 25k sometime on April 11 and is now about > 25,090. > > -Michael > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Wed Apr 23 01:13:26 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Apr 2008 04:13:26 EDT Subject: [gutvol-d] reaching 25k (The PG Monthly Newsletter, April 21, 2008) Message-ID: my estimate is 15,000 e-texts. as in _books_. or even _pamphlets_. or _declarations_of_independence_. words. gotta have words. gotta be words. human genome sequences, no. thesaurus absolutely, dictionary wonderful, book of quotes sublime... mp3's, no. not even audio-books, readings of text? no. the sears catalog...was certainly a printed book, so yes. shakespeare...naturally. dante...you betcha. jack london...sure. all the "number" e-texts, no. that one file which is a giant tarbaby of the entire library up to that time, um, no, and actually i wished you woulda warned me when i innocently asked to download _that_ puppy; make it "by special request only", ok? it's great that there is _more_ than e-books in this e-library, it's great, but let's count the other media as what they really are, not as "e-books". the atomic bomb videos were a great add to p.g., but they ain't "books"... and do not get me started on repeats. no, don't get me started on repeats. every book in the bible as a separate e-text. and then the whole bible. every story in the collection. yes, and then the collection as a collection. and yeah, i know the historical reasons, but it's time to zip up that past. and the thing is, when you decide all you need to keep is the collection, then you keep _1_ e-text and discard a _bunch_. as in 66 in the bible... and that ratio drops your total-number down in fast dramatic fashion. so yeah, i say 15,000. maybe. and if it's more or less, i guess less... -bowerbird p.s. your numbers are now dwarfed by google anyway, so forget that... "it's not the size of the dog in the fight, it's the size of the fight in the dog." ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080423/22c20182/attachment.htm From Bowerbird at aol.com Wed Apr 23 02:55:01 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Apr 2008 05:55:01 EDT Subject: [gutvol-d] parallel -- the plunderer -- 02 -- another upbeat post Message-ID: ok, well i spent another hour on "the plunderer", the "latest" parallel experiment from the d.p. labs. you might remember this one had clean scans and abbyy o.c.r., plus had a fair dose of preprocessing, which -- along with a whiff of my clean-up tool -- means that two hours is about all the time it needs. i tried to hold back, i really did, but i just couldn't. roughly 106 corrections in the 318 pages, so 2/3 of the pages were clean from the get-go. so it's done now. at least done enough to go to the public for continuous proofing... that's projected as less than 1 error every 10 pages, meaning 32 errors or less on this 318-page book... floating single-quotemarks haven't been fixed yet, so those don't count. and i haven't formatted yet, but i saw no italics in pages 1-106, the first third... blocks need indentation, but everything other than that should be right if'n i played all the cards right. so have at it, kids, see if you can catch me with 33... i've uploaded the thing to my website: > http://z-m-l.com/go/plund/plund.zml you can also look at it page-by-page if you like: > http://z-m-l.com/go/plund/plundp001.html > http://z-m-l.com/go/plund/plundp034.html > http://z-m-l.com/go/plund/plundp123.html > http://z-m-l.com/go/plund/plundp234.html > http://z-m-l.com/go/plund/plundp303.html if you'd like your error-reports time-stamped, just do 'em with the web-form on the appropriate page, or send 'em to this list, where i receive them happily. -bowerbird p.s. this is how fast books could be getting digitized, provided you simply do each of the steps competently. ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080423/94dd0d3e/attachment.htm From joshua at hutchinson.net Wed Apr 23 06:19:01 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Wed, 23 Apr 2008 13:19:01 +0000 (GMT) Subject: [gutvol-d] reaching 25k (The PG Monthly Newsletter, April 21, 2008) Message-ID: <533090630.262031208956741803.JavaMail.mail@webmail04> Maybe it's time to stop the "reserved number" practice. It's just gotten too hard to keep track of where we stand. Then again, maybe not. Just an idea. Josh *** On Apr 22, 2008, hart at pglaf.org wrote: Since I stopped counting every single entry by hand, and then comparing notes with at least one other human count, we have had to rely on an automated count program that is sometimes too high and sometimes too low. If anyone would care to heolp do a serious counting effort, I would be only too happ. . . . Thanks!!! Michael From joshua at hutchinson.net Wed Apr 23 06:25:03 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Wed, 23 Apr 2008 13:25:03 +0000 (GMT) Subject: [gutvol-d] reaching 25k (The PG Monthly Newsletter, April 21, 2008) Message-ID: <1879818285.262881208957103616.JavaMail.mail@webmail04> On Apr 23, 2008, Bowerbird at aol.com wrote: not even audio-books, readings of text? no. *** In my library, we use an "item" count. Any thing in the collection is included: book, DVD, CD, Audio Tape, and yes, even MP3 audio texts and players. So, if you view PG as a library, then all those things count in the "grand total". Josh PS I'm the Library Board President, so I'm not just making this up. I get monthly reports of this type of stuff. From Bowerbird at aol.com Wed Apr 23 10:13:09 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Apr 2008 13:13:09 EDT Subject: [gutvol-d] the best thing i've read this year Message-ID: this is the best thing i've read this year: > http://indiekindle.blogspot.com/2008/03/feedback-is-filter-who-will-distinguish.html a very interesting topic, an excellent grasp of the relevant relationships, and some intensely beautiful language, all come together in brilliance... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080423/de796735/attachment.htm From Bowerbird at aol.com Wed Apr 23 11:18:11 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Apr 2008 14:18:11 EDT Subject: [gutvol-d] PG ebooks on OLPC XO Message-ID: legutierr said: > I am planning on writing a Python program to read PG ebooks, > for eventual use on the OLPC XO laptop, > and for use on Macintosh computers great! > What I want the program to do is to read PG ebooks in ascii form, > interpret them semantically and then display semantic elements > according to either the user's preferences, > or the requirements of the platform. great! > The first step in this project for me will be to define a grammar > that can be used to build a parser (I don't want to do this heuristically). one question: have you ever written an e-book program before? if not, then i'd suggest you'll want to make _that_ "your first step"... and that tumbles out a torrent of questions, including: have you written any kind of substantial program before? have you written anything for the o.l.p.c. machine itself? do you own an o.l.p.c. machine? > I'm thinking that a good starting point will be to create a formal grammar, > if one does not already exist, that describes the formatting standards > on the DP website (http://http://www.pgdp.net/c/faq/document.php). > From reading the DP website, it seems that a good number of PG books > are being submitted now with that format. "in theory, there's no difference between theory and practice. in practice, there is." if you look at the actual books, you will see that they're still wildly inconsistent, not just from each other, but from the "official" formatting "standards" as well... > Does anyone know if other well-defined > formatting standards exist within the PG library? what nobody here wants to tell you is that i've been working on your idea for years now. and for years now, a whole bunch of people here were noisily telling me it was _impossible_ to do... at least they _were_ telling me that, very insistently, until i basically presented enough evidence that they finally realized that they were wrong, and i was right... then they fell strangely silent... in a nutshell, they lost all their credibility on this very issue, so you can understand why nobody has replied to your post. it's kind of a sensitive subject. and, with just one post to this listserve -- your very first -- you've already proven yourself smarter than the nay-sayers. > Does anyone want to guess about the percentage of PG ebooks > that might validly comply with some kind of formal grammar that > could be precisely defined with no modification at all, under 5%. with modification, over 95%. or more. perhaps up to and even over 99%. as for the modification required... _most_ of it can be automated, providing you know the inconsistencies that need to be changed... i'd been poking around the p.g. e-texts for several years beforehand, so it only took another year or so of concentrated focus to get the rest. (and of course, one can never be confident that one really knows it all.) the biggest thing that i haven't (yet) figured out how to auto-format is the frontmatter section, most specifically titlepages... (contents pages are easier to fix, because there is a singleness of purpose about them.) even titlepages would be easy enough to auto-format, _except_ that there's often a potpourri of text and it's difficult to know what to cull... my modus operandi is to do a few thousand of them by hand, and then apply the lessons i've absorbed to write a tool that does it automatically. but i haven't been able to force myself to do more than a few hundred... so that's where my project is stalled at the moment... it only takes a minute or so to fix each one. but with 15,000 e-texts... eventually, i'll probably just give up on it, chuck the existing titlepage, and create a new titlepage based upon the information in the catalog... of course, i'll save the old titlepage text, so when i have the library in a wiki format, which is the next part of the plan, people can _restore_ it... > I'm fully aware that the formatting issue is a pretty big one, and that, > as a whole, the PG library is a bit of a mess formatting-wise, > but I want to give it a shot anyway. have at it. i'm willing to give you as much (or as little) help as you need. or, if you want to help me instead, you know where my project is stalled. (and you can also feel free to offer help in any other way, if you'd prefer.) > Any thoughts out there? Anyone want to take the plunge with me? look at what's already been done first. the best place to start will be gutenmark. then look at the gutcheck tools. (although, since they're used to clean the e-texts before they get posted, you'll discover that most of the e-texts are free of the inconsistencies that the gutcheck tools discover.) and finally, look at work i've done on z.m.l.: > http://z-m-l.com > http://www.z-m-l.com/go/vl11.zml > http://www.z-m-l.com/go/test-suite.zml > http://z-m-l.com/go/pudding_sampler.html -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080423/36dd53c8/attachment-0001.htm From legutierr at hotmail.com Wed Apr 23 14:58:19 2008 From: legutierr at hotmail.com (L Gutierr) Date: Wed, 23 Apr 2008 17:58:19 -0400 Subject: [gutvol-d] PG ebooks on OLPC XO In-Reply-To: References: Message-ID: In response to Bowerbird: > one question: have you ever written an e-book program before? > if not, then i'd suggest you'll want to make _that_ "your first step"... I would say that building an e-book program is actually my "last step", in that it is my objective. > and that tumbles out a torrent of questions, including: > have you written any kind of substantial program before? Yes, certainly. > have you written anything for the o.l.p.c. machine itself? Yes. > do you own an o.l.p.c. machine? Yes. >> Does anyone want to guess about the percentage of PG ebooks >> that might validly comply with some kind of formal grammar that >> could be precisely defined > > with no modification at all, under 5%. with modification, over 95%. > > or more. perhaps up to and even over 99%. > > as for the modification required... _most_ of it can be automated, > providing you know the inconsistencies that need to be changed... Are the numbers you are providing an estimate, or do you have more precise data? If you could make precise data available, that would certainly be useful. If you do have precise data, it would also be useful to know how you have derived that data. And, just so I'm clear, what do you mean by "modification" and "automated modification"? > look at what's already been done first. > > the best place to start will be gutenmark. then look at the gutcheck tools. > (although, since they're used to clean the e-texts before they get posted, > you'll discover that most of the e-texts are free of the inconsistencies that > the gutcheck tools discover.) and finally, look at work i've done on z.m.l.: >> http://z-m-l.com >> http://www.z-m-l.com/go/vl11.zml >> http://www.z-m-l.com/go/test-suite.zml >> http://z-m-l.com/go/pudding_sampler.html I have already looked through the gutenmark code and it is not appropriate for this project, in that it does not do anything like semantic analysis. What gutenmark does is it looks at a segment of text and decides how it should be formatted based on the text that surrounds it; it does not try to interpret how that text fits into the semantic structure of the document, nor does it look like any adaptation of gutenmark short of a rewrite would be able to perform such analysis. I've read a bit about gutcheck, and it seems to be a punctuation and spell-checking program. It seems to be looking at the contents of chapters, not the structure of each book. I've also looked at your zml pages. Although your zml does seem to provide a consistent set of formatting rules that can be the basis of semantic analysis, I suspect that only a small portion of the ascii texts within the PG library are formatted in parsable zml. Do you know what portion of PG ebooks are in valid zml format? My objective is to accurately interpret a large majority of ebooks right out of the starting gate, without manual reformatting. The existence of other wide-spread conventions or standards, if consistently followed across a significant enough minority of ebooks, would allow that objective to be reached. I was also wondering whther the source of your zml parsers, etc., available? The other programs that you mentioned are open-source, I believe. Thanks! _________________________________________________________________ In a rush? Get real-time answers with Windows Live Messenger. http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_realtime_042008 From Bowerbird at aol.com Wed Apr 23 20:28:13 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Apr 2008 23:28:13 EDT Subject: [gutvol-d] PG ebooks on OLPC XO Message-ID: i said: > > have you written any kind of substantial program before? legutierr said: > Yes, certainly. great! (not everyone has, you know.) :+) > I would say that building an e-book program > is actually my "last step", in that it is my objective. well, it's where you want to end up, of course. :+) but it helps to know what you need to have to get there, in order to successfully plan your route in getting there. i have written dozens of different e-book programs, and every one of 'em has its own little idiosyncrasies, depending upon the unique angle i'd pursued with it, so the individual choice-points you face are multiple... those programs also all have a common thread to them, and it is useful in pursuing a path to know about _that_. so don't underestimate the importance of doing a demo. it will teach you things that you didn't know you needed... the final consideration here is that the world has become apathetic to e-book programs, perhaps outright hostile, so unless you think you can deliver one that has "charm", you should know that that is the audience you're facing... _i_ love the things, and i'm always thrilled to look at what the next person has designed, so i'd tell you to "go for it". but i nonetheless think you deserve to know the situation. i said: > > have you written anything for the o.l.p.c. machine itself? legutierr said: > Yes. even better... because that machine poses certain challenges, so if you've actually put an app on the thing, you're way ahead. and i will ask you for help when i turn to my little green friend. what did you program for it? is it available for download? > Are the numbers you are providing an estimate, > or do you have more precise data? my numbers are "an estimate". but you can bet they're accurate. i've spent a lot of time probing the library to obtain that estimate. > If you could make precise data available, > that would certainly be useful.? well, with numbers as diametric as 5% versus 95%, it's shouldn't be hard to tell if i'm out of the ballpark. :+) > If you do have precise data, it would also be useful to know > how you have derived that data.? i do research. and i pay (very close) attention to the results. i write routines to resolve the inconsistencies in the library. i see how many changes were made, which ones "worked", and which ones ended up not achieving the desired results. then i iterate from there. and iterate again. again and again. > And, just so I'm clear, what do you mean by "modification" > and "automated modification"? let's take one of the easiest examples. the "official" spec for a header calls for 4 blank lines above, and 2 blank lines below. straightforward enough, agreed? you'll find precious few e-texts -- my guess is under 20% -- where _all_ of the headers have followed that rule _precisely_. the modification is clear enough -- adding or deleting the surrounding blank lines until you get the required number -- and this is a change i'd feel comfortable doing automatically, considering that i have several independent ways to determine whether any particular chunk of text _is_ or _is_not_ a header... but again, that's an _easy_ example. you'll find that _footnotes_, for instance, aren't nearly as simple. not only were there many different conventions across time, but different people followed their own "instincts" in most e-texts... now add in the extra complications from things that are _like_ footnotes, but not exactly -- such as sidenotes and runheads -- and before you know it, the nightmare has grown enormous... and, as if all that weren't enough, there has been a certain degree of -- how shall we put this? -- _casual_attention_to_the_details_... to see a concrete example, download a copy of p.g. e-text #13603, "the hawaiian romance of laieikawai". then pull out all the lines that contain "[footnote", as that is how the footnotes were indicated there. here's part of what you'll get: > [Footnote 21: Compare Gill's story of the first god, Watea, who dreams > [Footnote 22: In the song the girl is likened to the lovely _lehua_, > [Footnote 23: No other intoxicating liquor save _awa_ was known to the > [Footnote 21: In the Hawaiian form of checkers, called _konane_, the > [Footnote 25: The _malo_ is a loin cloth 3 or 4 yards long and a foot > [Footnote 28: In Hawaiian warfare, the biggest boaster was the best man, > [Footnote 27: The idiomatic passages "_aohe puko momona o Kohala_," > [Footnote 28: This boast of downing an antagonist with a single blow is > [Footnote 29: Shaking hands was of foreign introduction and marks one of > > [Footnote 71: In mythical quest stories the hero or heroine seeks, by > [Footnote 72: According to the old Polynesian system of age groups, the > [Footnote 73: The name Laukieleula means "Red-kiele-leaf." The kiele, > [Footnote 74: The story of the slaying of Halulu in the legend of > [Footnote 75: The divine approach marked by thunder and lightning, > [Footnote 75: Kaonohiokala, Mr. Emerson tells me, is the name of one of > > [Footnote 1: Compare the fishhook Pahuhu in _Nihoalaki_; the _leho_ > [Footnote 1: Compare _Kalelealuaka_.] > [Footnote: 1 This means literally "to travel over land and sea." (See > [Footnote 1: This is not the Olopana of Hawaii.] > [Footnote 1: This is only a fragment of the very popular story of the > [Footnote 2: Rev. A.O. Forbes's version of this story is printed in > [Footnote 1: See Daggett's account, who places Moikeha's role in the > [Footnote 1: Kaulu meets the wizard Makalii in rat form and kills him by > [Footnote 3: Daggett tells the story of _Hua_, priest of Maui.] right there, in plain view, you can see several numbers out of sequence, and one malformed instance. if something as direct and dirt-simple as a sequence of numbers has bugs in it, how much other garbage is there? this is what you have to deal with. believe me, it's not easy. it's _doable_. and it's not even difficult, if you have enough determination and tenacity. but that doesn't mean it's easy. > I have already looked through the gutenmark code > and it is not appropriate for this project, > in that it does not do anything like semantic analysis. i have two basic reactions to that. the first is to say "oh-oh, that word 'semantic' is a loaded one." for some people, it means "artificial intelligence", and that can make me run screaming from a room. if that's what you mean, all i can do is wish you luck and pray for your soul... :+) the second reaction is to say "i didn't mean the gutenmark _code_, i meant all the _difficulties_ ron burkey describes on the site which led to his eventual abandonment of the project, which can probably be summarized in the single word 'inconsistencies' for this thread..." ron is just one of many programmers who ran screaming from the lobby of the project gutenberg library because of _inconsistencies_. > I've read a bit about gutcheck, and it seems to be > a punctuation and spell-checking program.? same with gutcheck. if you look at its innards, you will find regular expressions that detect the types of inconsistencies that still plague much of the old contents of the p.g. library. > I've also looked at your zml pages.? Although your zml > does seem to provide a consistent set of formatting rules > that can be the basis of semantic analysis, I suspect that > only a small portion of the ascii texts within the PG library > are formatted in parsable zml. as i said, my estimate is 5%. so in one sense, you are correct. i also said that, with the modification programs i've written, i can make 95% of the library into perfectly clean z.m.l. text. so -- in that more-important sense -- you are _incorrect_... > My objective is to accurately interpret a large majority of ebooks > right out of the starting gate, without manual reformatting. i wasn't able to do it. and my guess is that no one else can either. but i'm not gonna risk any of _my_ credibility trying to tell you that you can't do it. because maybe you're the person who _can_ do it, in which case i'll be behind you every step, cheering you on loudly. but if you expect that things will just fall out of the pile correctly, it will not take you very long to see that that will not be the case... that doesn't mean you'll have to do "manual" reformatting, though. right away, you'll see ways you can write routines to do it for you, and if you begin, and persist, and persist some more, and more, then you will be doing what i've done for the past couple years... and, in the end, you too will be able to use your correction routines to transform the inconsistent p.g. library into a consistent version... but, in the end, you won't be able to do anything that i can't do, since you and i will both be mounting "mirrors" of the p.g. library which allow the structure of the e-texts to be machine-processed. (that'd be great, since our libraries could cross-check each other.) or, on the other hand, you can just wait until i release my mirror... then you'll have a copy of the library that you can machine-process. of course, at that time, so will everyone else, which means that your viewer-program will have to compete against my viewer-program and everyone else's viewer-programs, which takes us back to my first suggestion to you, which is that you write your program first... by the way, here's the general spec i used for my viewer-app: > http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2004& post=2004-01-08,3 that post is over 4 years old, but the feature-set is still cutting-edge. make a viewer-app that can do all that, and you will have made one that is as good as mine. make one that can do _more_ than all that, and you will have made one that is _better_ than mine, so hooray! > The existence of other wide-spread conventions or standards, > if consistently followed across a significant enough minority > of ebooks, would allow that objective to be reached. the magic phrase is "if consistently followed". and you will very quickly conclude that it is simply _tragic_ that the whitewashers haven't been smart enough to see that wisdom. because the library now could be _immensely_ more powerful if they had seen the value of consistency, and brought it about. > I was also wondering whther > the source of your zml parsers, etc., available?? my source is not available, no. you can buy it, but the pricetag is 6-figure. but if you want to build some open-source programs that support z.m.l., i will be most happy to give you my full (and free) assistance in doing that. > Thanks! sure thing... my pleasure... :+) -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080423/2dfce38a/attachment-0001.htm From Bowerbird at aol.com Thu Apr 24 12:50:52 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 24 Apr 2008 15:50:52 EDT Subject: [gutvol-d] digitization filenaming conventions, circa 2008 Message-ID: over at distributed proofreaders, some time back, someone said: > > If one of the points of this naming scheme is > > to present things in the same order as the physical book, > > why does the rear cover of the book get a "c" prefix? > > Shouldn't it be included in "r" to be in the proper place? and donovan replied: > ...then it's a whole lot of extra effort for no real gain > since 000-through-whatever already achieves that goal. > Proponents of the various other file naming schemes > want the filenames to indicate something "significant" > about the actual content on the page image, > and conveniently ignore the sorting calisthenics > they have to go through to get these "value-added" filenames > to come out in the same order as the actual physical book. this point needs to be addressed, because it's so dead-on wrong. and let's start with the last point first: if you name your files right, their _natural_ sort-order automatically arranges the files exactly as pages appeared, "in the same order as the actual physical book". no "calisthenics" required... i am a proponent that filenames "indicate something significant" about the content residing within, specifically _its_page-number_. and -- obviously -- that's because the page-number _is_ significant. the page-number is significant because that is what people have used for centuries now when we make reference to a piece of a book's text. the page-number is the dead-tree-book equivalent of a _hyperlink_. it has a long, glorious history in our paper-archives, and we would be totally and completely remiss to do things that disturb those "links"... no sir, we need to _maintain_ those links. it is vitally important, and if your e-library doesn't do it, your e-library will crumble into chaos. especially in the transition period as we move from paper to digital, since it will be important during that time to be able to synchronize. so, in a nutshell, donovan has things bass-ackwards... we go through some slight "calisthenics" to create meaningful names for the _huge_future_benefit_ of making use of that meaningfulness. (if you think about it, this is why names have utility in the first place; they're a shortcut that enables us to know what we're talking about.) having files "reflect the order of the original paper-book" is _one_ goal, sure. but it's not the _only_ goal. it's not even the most important goal. the most important goal is to know how to ask for exactly what we want. when i link to rfrank's parallel-proofing experiment like this: > http://z-m-l.com/go/plund/plundp011.html > http://z-m-l.com/go/plund/plundp111.html > http://z-m-l.com/go/plund/plundp211.html > http://z-m-l.com/go/plund/plundp311.html and you know that my library's filenames reflect the pagenumber, you know you've been directed to pages 11, 111, 211, and 311... so if you had a paper-book, you'd know where those links went. even more valuably, you know how to generate the link to _any_ of the 318 pages in that book, and know it with zero uncertainty, so you can resolve a paper-based reference without ambiguity... synchronization. we can't live without it. but hey, all of this is pretty obvious. indeed, it is _so_ obvious that p.g. has decided that image-scans in its library must be so-named. this means that -- before they can upload their scans to p.g. -- somebody at d.p. has to go through a file-renaming exercise... for a while, it was josh, who apparently was doing the renaming in some kind of manual fashion, until he became too exhausted. so he said he couldn't continue, that it took him too much time, and dkretz stepped in with an offer to create a tool to help out... (this was shortly after i had offered my own such tool to people.) and a few months later, dkretz and d.p. now have a good tool that a team of people have been using to rename scans headed to p.g. i think they're still downloading all of the images to rename them, and then uploading them as well, which is a massive waste of time and bandwidth -- much better to rename them right on the server and just have the p.g. computer grab them from there -- but hey, at least they ain't doing the renaming manually any more. hooray! maybe they think the download/upload dance "builds character"... however, there is still a huge weirdness going on. because d.p. itself still uses the old, stupid filenaming convention, for the most part. individual content producers can do what they like, but d.p. itself has made no commitment to the new method... which means the overall situation still has _2_ sets of numbering. and anyone who wants to discuss the process over there at d.p. -- e.g., as i've been doing here when discussing their experiments -- has to constantly juggle the difference to make themselves clear... moreover, the volunteers at d.p. have to use the stupid system... so here's another case where incompetent content providers are making decisions which impact adversely on others downstream. and let me be totally clear on this! even if it didn't make things easier for the rest of the world, d.p. should change its filenaming convention to make things _easier_for_itself_... it _is_ simpler -- much simpler! -- to work on a project where the pagenumber and the filename are one and the same thing. donovan seems to believe that the old system is simpler. perhaps it's simpler for _him_. but not for the _proofers_... every time someone points to page#x, and someone else goes to file#x, only to learn they need to go to file#x+offset, it's a waste of time... oh sure, it's only a mere annoyance when they do not match; it causes no more than a little blip when you hit a mismatch... but multiply each of those tiny blips by thousands of projects -- thousands of people, hundreds of thousands of pages -- and realize the cumulative effect is not quite so tiny after all... and when it can be totally eliminated with a simple solution, it should be a no-brainer to eliminate it. not only is this number/name match _simpler_, it's _better_. if you know the name and number are supposed to match, it's obvious that you have a problem when they don't match; either (1) you missed a page, or (2) you duplicated a page... and no matter which it was, when you correct the problem, you will have to rename some of your files midstream if you want to maintain your original structure, which is a big pain, because now you the same file will have two different names (before and after), and the same name will have been used by two different files (one who had it before, the other after). these kinds of problems use up a disproportionate amount of time when they aren't discovered until late in the project, and a sensible naming convention gives notice right away... it's also easier to identify the unnumbered illustration plates when their filenames have something that differentiates 'em. so the so-called "sorting calisthenics" don't _cost_ you time, they're an investment that usually means _saving_ you time... so anyone digitizing a book should be using smart filenames. for the entire workflow, from the very start up to the very end. and best of all is when the person _scanning_ does it right... then the files _never_ have a bad name, and the o.c.r. output is named correctly from the get-go, and everyone is happy... yet dkretz -- who seems to have learned he had better not challenge donovan in any way, direct or indirect -- has thus gone out of his way to avoid the suggestion that his tool can be used at the _start_ of the d.p. process, not just at the end. poor dkretz... he's one of the smartest guys over there, but he feels the need to walk on eggshells to avoid clashing with the powers that be -- as the "leadership" jokingly calls itself. (at least they have some sense of humor. but seriously, you better not cross 'em, or you'll learn it doesn't run too deep.) but the long-term outlook is bright... as more and more of the content providers "learn the secret" of the benefits of the pagenumber/filename match, they will increasingly switch... a good number of them already have, and more will follow. because of this, scripters have recently modified guiguts (one of the main d.p. tools) so that it deals with the new naming style. so -- eventually -- the problem will disappear, as more content providers wise up, perhaps encouraged by proofers who've reaped the benefits... still, it is a sad commentary that the d.p. "leadership" is so blind. -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080424/d4199394/attachment.htm From Bowerbird at aol.com Thu Apr 24 13:07:05 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 24 Apr 2008 16:07:05 EDT Subject: [gutvol-d] auto-detection of errors Message-ID: i was going to write an extended _series_ of messages on auto-detection techniques -- doing them, using results, etc. -- but decided instead to just do an outline in one message... so i've appended a first-draft, off the top of my head... i might expand on it, or elaborate, in the future... or i might not... ideally, i guess it should be a wiki-page... this ain't rocket-science. just simple analyses of text... feel free to make suggestions, front-channel or back... -bowerbird ============================================== page-level... 1. runhead corrections chapter-level... 2. chapter segmentation 2a. ensure proper header structure 2b. garbage around drop-caps 2c. downcase ornamental uppercasing paragraph-level... 3. paragraphing 3a. ensure proper paragraphing 3b. spacey-quotes 3c. unbalanced quotes 3d. paragraph termination 3e. extraneous blank lines character-level... 4. garbage character resolution word-level... 5. low-frequency words not in dictionary 5a. scannos 5b. book-idiosyncratic 5c. lowercased are typically scannos 5d. uppercased are typically names, and correct 6. high-frequency words not in dictionary 6a. almost always correct 6b. names 6c. book-idiosyncratic 6d. dialect/slang 7. examine names 7a. in name dictionary 7b. following title-terms like "mr." or "dr." 7c. capitalized word following comma 7d. capitalized word mid-sentence 7e. consecutive capitalized words 8. missing whitespace 8a. joined.words 8b. runtogetherwords 9. words with mixed alpha/numeric 9a. except ones like 1st, 2nd, 3rd 9b. autochange 1-space-quote to !" 9c. autochange 1 in word to lowercase "l" 9d. autochange 1 solo to capital-i 9e. autochange mid-word capital-i to lowercase "l" 10. examine all-numeric strings punctuation-level... 11. punctuation improbabilities 11a. space-period-space 11b. space-comma 11c. 1-space-quote = !" 11d. cap-word after comma 11e. period-comma 11f. comma-comma 11g. 4 periods to 3 dots 11h. 2 periods to 3 dots 11i. period-space-period to 2 dots ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080424/389a0f79/attachment.htm From donovan at abs.net Thu Apr 24 18:31:35 2008 From: donovan at abs.net (D Garcia) Date: Thu, 24 Apr 2008 21:31:35 -0400 Subject: [gutvol-d] digitization filenaming conventions, circa 2008 In-Reply-To: References: Message-ID: <200804242131.36114.donovan@abs.net> On Thursday 24 April 2008 15:50, Bowerbird at aol.com wrote: > i am a proponent that filenames "indicate something significant" > about the content residing within, specifically _its_page-number_. > and -- obviously -- that's because the page-number _is_ significant. > the page-number is significant because that is what people have used > for centuries now when we make reference to a piece of a book's text. > no sir, we need to _maintain_ those links. it is vitally important, and > if your e-library doesn't do it, your e-library will crumble into chaos. DP is not an e-library; it is a production process. Although bowerbird seem to be confused about it, it should be obvious that the requirements for a production process can be--and usually are--different from those appropriate to a storage/retrieval/reference environment such as PG. > this means that -- before they can upload their scans to p.g. -- > somebody at d.p. has to go through a file-renaming exercise... Corrected to read: "This means that--before they can upload their page image scans to PG--*everybody* has to go through a file-renaming exercise due to PG's filenaming conventions." > moreover, the volunteers at d.p. have to use the stupid system... Corrected to read: "... can and do, at their option, choose to use either system." > so here's another case where incompetent content providers are > making decisions which impact adversely on others downstream. Corrected to read: "Here's another case where content providers are performing the tasks which benefit the organization they have chose to volunteer for. Other volunteers at other organizations are free to perform other tasks which benefit their organization." Bowerbird proposes to pass this renumbering work off to people who don't want to do it, don't know how to do it, or don't care about it, into a system which does not require it to perform its intended function. That's an increased cost (training and code changes, tool changes, etc.) with no benefit. It actually turns out that, in spite of his wild conjectures about "thousands and thousands of blips", the vast majority of DP volunteers are intelligent and flexible enough to effortlessly comprehend filenames unrelated to the page number, in large part because the process does not even require them to be aware of any difference. More to the point: his proposal pushes this work away from people who actually do want to do it and already know how. It would be interesting to hear how that is better than having a self-selected, experienced group of specialists who are already doing this work for both organizations, to their mutual benefit. > still, it is a sad commentary that the d.p. "leadership" is so blind. It's a sad commentary that bowerbird is so frequently disrespectful of almost everyone actually producing e-books. These volunteers have made a conscious choice to actively support their organizations--warts, inefficiences and all. His passive and extensive diatribes do almost nothing to support DP or PG. I accept that he disagrees with many of the ways in which DP operates, but he's not going to change anyone's mind by alienating through insults the very people he's trying to get his message to, no matter how many times he repeats it. D. Garcia System Administrator www.pgdp.net From Bowerbird at aol.com Thu Apr 24 21:00:15 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Apr 2008 00:00:15 EDT Subject: [gutvol-d] digitization filenaming conventions, circa 2008 Message-ID: donovan gets routed to my spam folder... and i ain't fishing him out... :+) but if he makes any good points, i do hope somebody rewrites them in your own words and posts them in a separate message... thanks folks... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080425/cdb25622/attachment.htm From rolsch at verizon.net Thu Apr 24 22:34:18 2008 From: rolsch at verizon.net (Roland Schlenker) Date: Fri, 25 Apr 2008 00:34:18 -0500 Subject: [gutvol-d] digitization filenaming conventions, circa 2008 In-Reply-To: References: Message-ID: <200804250134.18932.rolsch@verizon.net> On Friday 25 April 2008 12:00:15 am Bowerbird at aol.com wrote: > donovan gets routed to my spam folder... > > and i ain't fishing him out... :+) Bowerbird, with post this you have finally placed "the straw that broke the camels back". I once respected and supportted your views. Occasionally, I learned something. But, I have finally grow tired of your rants. Good-bye, Roland Schlenker From Bowerbird at aol.com Thu Apr 24 23:55:01 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Apr 2008 02:55:01 EDT Subject: [gutvol-d] digitization filenaming conventions, circa 2008 Message-ID: roland said: > I once respected and supportted your views.? really? i don't remember any such "support"... perhaps you could refresh my memory please? because i would certainly hate to be ungracious. > Occasionally, I learned something.? as they say, even a stopped clock is right twice a day... :+) > But, I have finally grow tired of your rants. i don't write "rants", roland... it's all cucumber-cool logic, thanks to the magic of my airport-runway retardant-foam. and donovan has always just done evasive maneuvers, so i see no reason to even bother with the latest round of that. but i sense i've disappointed you by failing to take the bait? of course, if he _had_ made some good points, you certainly had the chance to present them in your own words, but you decided to take a different tack, which i find as meaningful... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080425/71971b92/attachment.htm From Bowerbird at aol.com Fri Apr 25 13:02:24 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Apr 2008 16:02:24 EDT Subject: [gutvol-d] republishing versus transcription Message-ID: > ?the well-made page is now what it was then: > a window into history, language and the mind: > a map of what is being said and a portrait of > the voice that is silently speaking.? > -- robert bringhurst > (from "the elements of typographic style") *** here's a post i wrote a while back, but never sent then. i dug it out to send because it has renewed relevance... *** robert said: > If your doing a study on classic sci-fi it made me chuckle to see "your" there, when you meant "you're". i'll explain why in a little bit... > If your doing a study on classic sci-fi and how it appeared > printed in issue X of generic magazine - shouldn't any errors > included in the original printing be maintained verbatim? robert, if what you are doing is "a study on classic sci-fi", then you should assume the responsibility of doing the digitization. that's why scholars get grants, and have graduate students... but as for what i understand the mission of project gutenberg, straight from the founder's mouth, we are creating a _library_, for ordinary people to _read_, so we don't want any typos in it. project gutenberg doesn't transcribe books, it republishes them. > If you fixing possible spelling mistakes at first, > what about the trend later to fix grammar mistakes there is a fine line here, to be sure, but it's not that hard to draw. and it's one that re-publishers have faced since the dawn of time. a very good source of guidance comes from "the author's intent". > so essentially PG is going to become the editor to fix things > that the original may of honestly put in there intentionally.? with the words "the original", you glossed over a difficult question. i consider the _author_ to be the creative entity of importance here, not the _publisher_. so if the author put something in, i'll honor it. but if the publisher put it in, not so much. i am the new publisher... > I have heard that some publishers or authors put in an occasional > mistake on purpose to verify if anyone else copied their work. and we take it out, so no one will think we copied their mistakes. :+) > Granted this is more likely to happen in a public domain anthology, > but at the same time something may com across as intentional. here with the "intentional" stuff, you've stimulated me to tell a story. on march 2nd, my girlfriend and i went to see the los angeles marathon. she pointed out to me a family with home-made posters on their backs saying "your a champ". (that's why i drew attention to your same typo.) i thought it was funny, and took a picture. later that day, at home, working on "planet strappers", this line on page 86 jumped out: > Your a pal. > http://z-m-l.com/go/plans/plansp086.png oops! so this is 1 of the 2 errors that i accidentally found in the book which managed to elude 7 rounds of proofers actively searching for errors. however, i was tipped to that paragraph because it also contained 2 other mistakes that were corrected earlier by various proofers -- "buddy" spelled as "buddie", and "brake" where "break" was proper. the third p-book error in that one paragraph got me to wondering, however, so i looked closely, and noted this paragraph described as "a note, pencilled jaggedly", which made me conclude these "errors" were _intentional_ on the part of the author, so i reverted the "fixes". (ironically, "pencilled" just showed up in my e-mail spellchecker, and -- since it's not part of the note -- it probably should be corrected... or maybe not. my p-dictionary says "pencilled" is the british version. but of course, this book hasn't been using any other british spellings. whatever, i'll let al worry about it...) ;+) > What about when Twain writes in a dialect - graned we know what > the words should be, but at the same time any proof reader would > see this as a mispelling. no. you see it as dialect. and you leave it, because twain intended that. *** so, in making the decision about whether the _author_ did something, or whether the _publisher_ did it, it helps if you are familiar with the _types_ of edits that publishers have traditionally enacted upon books. here i am _not_ talking about the changes that an _editor_ will make -- often in conjunction with the author, but sometimes not even -- which involve modifications to the actual way the story was written... rather, i'm referring to changes that a _copy-editor_ would make... (and the copy-editor rarely, if ever, does consultation with an author.) these changes are often at the level of nitty-gritty, usually having to do more with _formatting_ than with the words per se. in this endeavor, the copy-editor is guided by what is called "the house style", which is the style-guide that the publisher has adopted to bring _consistency_ to the books that they publish. you might be familiar with some of the major "style guides", such as the chicago manual of style, or the modern language association... > http://www.chicagomanualofstyle.org/home.html > http://www.mla.org/ > http://www.bartleby.com/141/index.html > http://www.grammarbook.com/ > http://apastyle.apa.org/ > http://www.calstatela.edu/library/styleman.htm > http://www.libraryspot.com/grammarstyle.htm > http://www.aresearchguide.com/styleguides.html > http://www.ipl.org/div/subject/browse/ref73.00.00/ > http://www.wsu.edu/~brians/errors/ > http://www-personal.umich.edu/~jlawler/aue.html > http://andromeda.rutgers.edu/~jlynch/Writing/ > http://grammar.ccc.commnet.edu/grammar/ > http://webtypography.net/ however, each publishing company will formulate its own style guide, maybe based largely on one of the main ones, but with its own quirks. it is _this_ which determines the formatting of the book when it's done. publishers do _not_ feel bound by "the author's intent" with regard to formatting, as they consider that task to be in the realm of _their_ job. nor are style guides forbidden from working at the word-level either... perhaps the most obvious example is that british publishers often use british spellings, even when the author is an american (e.g., willa cather), the story is set in the u.s. ("my antonia"), and the book is sold in the u.s. also within the purview of the copy-editor is how certain words are spelled, and whether they are hyphenated are not, not just _within_ each book, but across all the books that are published by the house. are movie names italicized? or put in quotes? how about book names? how is a newspaper headline formatted? a magazine excerpt? a sign? wired magazine actually put up a web-page that said "no big deal, but from now on, we're gonna stop capitalizing the word 'internet'". some people take these things seriously. they're paid to do just that. if a character applying for a job sends in his resume, does that word have one diacritic, two, or none? are footnotes indicated by asterisks, or are they numbered? or do you use endnotes instead of footnotes? what ellipsis style will you use, a mix of 3- and 4-dot, or always 3-dot? are related sentences joined with a semi-colon, or treated separately? do em-dashes routinely set off phrases, or are commas used instead? is the serial comma used, or not? are abbreviations used, or scorned? the copy-editor is also responsible (along with the main editor) for "ensuring grammatical consistency", giving them a liberal license to rewrite small chunks of the text, and even some not-so-small chunks. many authors have been chagrined by reading the results of those "grammatical edits", since sentences can end up being far different than the author would've written them. but it's too late, it's printed. and, to be realistic about it, sometimes what authors submit is shit. it _needs_ to be copy-edited. if it were to be released as the author wrote it, it would be an embarrassment, to the author and the house, except it is only the house who would be blamed, because everyone familiar with publishing knows it is the _responsibility_ of the house to do copy-editing. they know that that is the job of the copy-editor. and that's not all, either. another person enters the equation as well, the _typesetter_, who also shapes the text into its final form as a book. what does a table of contents look like? how does a bibliography look? how should block-quotes be formatted? how is a data-table formatted? i won't go on and on with all the mods made at the typesetter-level, but the point should be clear about the limits of "authorial intent"... in a nutshell, it's not always that easy to tell. and, to the extent that untrained volunteers like ourselves _can_ tell, readers are equally capable, so we don't have to make the decisions for them, by and large, we can just present what was in the p-book, according to our own "house style", and they can draw conclusions... > I'm sure that the proofreaders are doing the best they can - > but in the end are we looking to end all errors - or are we looking to > make sure that the finished text is 100% accurate to the source text? most of what we're talking about here is correcting the _o.c.r._ errors; there's total congruence with fixing those and matching the p-book... but yeah, if p.g. is what i understand it to be, we don't care a whit about "making sure that the finished text is 100% accurate to the source text"... *** to my mind, the very ethos of _republishing_ is that you _should_ fix any errors. your job as publisher is to make it as correct as you can... and there is another consideration here too, one that you need to be aware of, once you have decided you're not a transcriber, and that is the ability to provide an inter-book _consistency_ to your library that might not (in fact, _does_not_) exist across the p-books you digitize. and the only reasonable conclusion to come to, in considering whether you want to activate that ability to enforce consistency, is that you _do_ want to do so, because that's the only strategy that will enable people to use computers to add value to your library. so you will not hesitate to make a change to the text if that's what it takes to give _consistency_... so this echoes what i've been saying about the value of _consistency_. if e-texts in the p.g. library had _consistency_, programmers would be busy crawling all over the thing, like a bunch of ants on a picnic table... -bowerbird p.s. i see that on wednesday, somebody at d.p. voiced similar thoughts... > We seem to have developed a certain veneration for the Author's Intent. > I think it is somewhat over-developed. > > Once a book enters the publishing process,[1] there are certain decisions > that are no longer in the author's control (author puts em-dashes > without spaces, standard for the series is em-dashes with spaces, > global search and replace). If the publisher has a standard for the > formatting of something, the manuscript will be adjusted to fit that standard. > > I had a very illuminating lesson in how much a book can change > from edition to edition when I needed to compare two editions > of a book to make sure that an error that had been spotted hadn't > occurred elsewhere in the book. The number of punctuation changes, > em-dashes added or removed, sentences joined and split, etc. > surprised me and made me reconsider how much the author's intent is > really reflected in the final published product in these older books. > > While I think that there are authors that used ellipses in specific and > intentional ways, I think that there were probably many more authors who > were subject to the the styles and whims of their editors and publishers. > > Simplifying the ellipses rule could be good but (for the most part) > it isn't a question of reproducing the author's intent, it is about making > a guideline that is easy to understand and apply in the vast majority > of circumstances (since there will always be exceptions). > > Julie > > [Footnote 1: I have worked in publishing in various jobs > in book production for more than 20 years.] > > http://www.pgdp.net/phpBB2/viewtopic.php?p=450173#450173 ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080425/58ea6192/attachment.htm From Bowerbird at aol.com Fri Apr 25 15:30:26 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Apr 2008 18:30:26 EDT Subject: [gutvol-d] have a good weekend Message-ID: i've just posted my newest version of "the plunderer", the latest parallel experiment over at distributed proofreaders, by rfrank... if you wanted to find my 33 errors, and you haven't yet, might be too late! ;+) > http://z-m-l.com/go/plund/plundc001.html ok, that was a full week. more next week. have a good weekend! -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080425/30ec1dee/attachment-0001.htm From Bowerbird at aol.com Fri Apr 25 15:58:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Apr 2008 18:58:28 EDT Subject: [gutvol-d] one more thing Message-ID: oh, one more thing, just because it's too funny... the "perpetual p1" project is now in its 7th iteration of p1 proofing. adding in the p2 and p3 rounds, and any formatting rounds it had, plus the post-processing and the following verification, as well as the whitewasher treatment, and the people following the research, including me, this text has faced a _lot_ of scrutinizing eyeballs... take a look at this line: > "Okay, Frank. Nobody's indispensible. I might do the same it you think "indispensible" is spelled correctly -- like i did (and do) -- you might wanna test it in your spellchecker, and check a p-dictionary. because, just like the 7th-iteration p1 proofer suggested -- it's really "indispensable". right there on page 119, an error in the paper-book: > http://z-m-l.com/go/plans/plansp119.png hey, isn't anybody doing a spellcheck on these books? ;+) anyway, chew on that fat over the weekend... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080425/53d8c48e/attachment.htm From j.hagerson at comcast.net Sat Apr 26 18:27:39 2008 From: j.hagerson at comcast.net (John Hagerson) Date: Sat, 26 Apr 2008 20:27:39 -0500 Subject: [gutvol-d] Orphaned Works Act proposed Message-ID: <00ec01c8a805$e207bd40$1f12fea9@sarek> All of the copyright mavens may already be aware of the following, but I stumbled upon this today and wanted to bring it to everyone's attention. http://arstechnica.com/news.ars/post/20080425-new-orphaned-works-act-would-l imit-copyright-liability.html http://www.publicknowledge.org/node/1537 For those who choose not to look at blog entries, please get the text of S. 2913 and H.R. 5889. The House sponsors are Howard Berman (D-CA), Howard Coble (R-NC), John Conyers (D-MI), and Lamar Smith (R-TX). The Senate sponsors are Patrick Leahy (D-VT) and Orrin Hatch (R-UT). It appears that if either of these bills was enacted into law, it would ease the use of copyrighted material when the copyright holder could not be located by reducing the penalties for infringement. (IANAL. Your mileage may vary. Do not use when driving or operating heavy machinery. Contents under pressure! Point cap end away from face when opening.) John Hagerson From hyphen at hyphenologist.co.uk Sat Apr 26 20:59:34 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Sun, 27 Apr 2008 04:59:34 +0100 Subject: [gutvol-d] Orphaned Works Act proposed In-Reply-To: <00ec01c8a805$e207bd40$1f12fea9@sarek> References: <00ec01c8a805$e207bd40$1f12fea9@sarek> Message-ID: <000001c8a81b$1c90e3a0$55b2aae0$@co.uk> John Hagerson wrote >All of the copyright mavens may already be aware of the following, but I >stumbled upon this today and wanted to bring it to everyone's attention. >http://arstechnica.com/news.ars/post/20080425-new-orphaned-works-act-would- l >imit-copyright-liability.html >http://www.publicknowledge.org/node/1537 These applies only to the USA the rest of the world stays on life + 70 or Life + 50 USA copyright law is in some respects laxer than the rest of the world. Dave Fawthrop From Bowerbird at aol.com Tue Apr 29 13:17:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 29 Apr 2008 16:17:42 EDT Subject: [gutvol-d] cat got your tongue? Message-ID: nobody has anything to say, eh? well, i guess that's ok... *** greg w. will probably think i'm picking on him, but hey, when you post interesting sci-fi stuff, people will read it. :+) anyway, this e-text: > http://www.gutenberg.org/files/25166/25166-h/25166-h.htm has 7 errors from the original, plus up to 3 more p-book errors... i'd post the scans, but al wouldn't trust them anyway... but you can still get them from d.p. with this link: > http://www.pgdp.net/c/tools/download_images?projectid=projectID4682d9652d4f6 or you can just take my word for it. (if you doubt me, let me know.) ---------------------------------------------------------------- > ing the part. he had known many > ing the part. He had known many > them well. even the emotional con- > them well. Even the emotional con- > simulate; candron knew what ?emo- > simulate; Candron knew what ?emo- > though it is, will require it least one > though it is, will require at least one > to make good propaganda usage of, > to make good propaganda usage of > both the man in the car and the > both the men in the car and the (it says "man" in the original, but it also says plainly there were 4 men in the car.) > He had wanted to keep those cigarettes.\ > He had wanted to keep those cigarettes. > imbedded in the door was connected > embedded in the door was connected (p-book error. p-dictionary points "imbed" to "embed".) > destructive-force before it does more > destructive force before it does more > mental endurance of a thousand planetsfull of > mental endurance of a thousand planetsful of (p-book error, if you ask me.) ---------------------------------------------------------------- 7-10 errors in a little over 10,000 words isn't terrible, but it isn't really good either, since many of these errors could have been autodetected and fixed before posting... the bigger point, of course, is that p.g. e-texts _continue_ to be released with too many errors, in spite of the fact that many people will try to tell you that problem is in the past... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080429/04d01cd1/attachment.htm From Bowerbird at aol.com Tue Apr 29 15:40:35 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 29 Apr 2008 18:40:35 EDT Subject: [gutvol-d] i saw a sony reader this weekend Message-ID: i saw a sony reader this weekend at the l.a. times festival of books... it's a nifty little machine, to be sure. (it's a touch smaller than i imagined, and much thinner than a paperback.) the display contrast was fine for me, and i didn't mind the reverse-flash that occurs on every screen-change, or the slowness of the refresh-speed. once they're engrossed in _reading_, most readers love e-book-machines, and there is nothing about this one that'd prevent that from happening. most of the interface was quite clear, and i'd guess that the rest would be easy enough to learn and remember. the absence of search remains huge. it was entertaining to hear sony reps answer questions about that "other" machine -- the one from amazon -- but i'd guess they know they need to put _some_ kind of web-connection in their next version, or be relegated to the alsoran-didnotfinish category. of course, that just brings up the next hard question, which is how they will manage to build up a catalog as large as the one that amazon already has... but here's hoping they can be healthy competition against the kindle, or else we won't see price-drops until 2012... -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080429/9dd39a9e/attachment.htm From rburkey2005 at earthlink.net Wed Apr 30 07:52:18 2008 From: rburkey2005 at earthlink.net (Ron Burkey) Date: Wed, 30 Apr 2008 09:52:18 -0500 Subject: [gutvol-d] New info on GutenMark software Message-ID: <1209567138.19489.14.camel@software1.heads-up.local> Hi All, I'm sure that some of you are aware of the open-source software called "GutenMark" (http://www.sandroid.org/GutenMark/download.html) which attempts to automatically convert PG plain-text files into HTML or LaTeX. I've recently made fairly significant improvements to the program, and it has somewhat belatedly occurred to me that it might be good to let you know about them. There have been no improvements to the conversion engine as such, but there is now a GUI front-end to the program, and there are both Windows and Linux installers. Furthermore, the download (if you don't need the program's source code) is an all-in-one thing, without the numerous options of "you may want this too" which used to be available. The net effect is to make the installation and use of the program extremely simple compared to the the past, particularly for someone new to it or who had bad luck trying to make it work in the past. On the downside, there is no longer any attempt to explicitly support Mac OS X or UNIX-type operating systems other than Linux, so if you're in the Mac world you're not in a position to take advantage of any of the improvements. -- Ron Burkey From hart at pglaf.org Wed Apr 30 09:37:39 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 30 Apr 2008 09:37:39 -0700 (PDT) Subject: [gutvol-d] i saw a sony reader this weekend In-Reply-To: References: Message-ID: I have played with several versions of the Sony, the newer ones are actually better, more options and features, the older ones sucked so bad I had to wonder how they ever got enough nerve to make the newer models. However, sales are SOOO bad that Sony won't tell how mnay have sold, and perhaps same with others. mh On Tue, 29 Apr 2008, Bowerbird at aol.com wrote: > i saw a sony reader this weekend > at the l.a. times festival of books... > > it's a nifty little machine, to be sure. > (it's a touch smaller than i imagined, > and much thinner than a paperback.) > > the display contrast was fine for me, > and i didn't mind the reverse-flash > that occurs on every screen-change, > or the slowness of the refresh-speed. > > once they're engrossed in _reading_, > most readers love e-book-machines, > and there is nothing about this one > that'd prevent that from happening. > > most of the interface was quite clear, > and i'd guess that the rest would be > easy enough to learn and remember. > > the absence of search remains huge. > > it was entertaining to hear sony reps > answer questions about that "other" > machine -- the one from amazon -- > but i'd guess they know they need to > put _some_ kind of web-connection > in their next version, or be relegated > to the alsoran-didnotfinish category. > > of course, that just brings up the next > hard question, which is how they will > manage to build up a catalog as large > as the one that amazon already has... > > but here's hoping they can be healthy > competition against the kindle, or else > we won't see price-drops until 2012... > > -bowerbird > > > > ************** > Need a new ride? Check out the largest site for U.S. used car > listings at AOL Autos. > > (http://autos.aol.com/used?NCID=aolcmp00300000002851) > From jared.buck at gmail.com Wed Apr 30 09:39:45 2008 From: jared.buck at gmail.com (Jared Buck) Date: Wed, 30 Apr 2008 10:39:45 -0600 Subject: [gutvol-d] i saw a sony reader this weekend In-Reply-To: References: Message-ID: I don't need an expensive sony reader to portably take ebooks with me. A Palm handheld works just as well :) On Wed, Apr 30, 2008 at 10:37 AM, Michael Hart wrote: > > I have played with several versions of the Sony, > the newer ones are actually better, more options > and features, the older ones sucked so bad I had > to wonder how they ever got enough nerve to make > the newer models. > > However, sales are SOOO bad that Sony won't tell > how mnay have sold, and perhaps same with others. > > mh > > > On Tue, 29 Apr 2008, Bowerbird at aol.com wrote: > > > i saw a sony reader this weekend > > at the l.a. times festival of books... > > > > it's a nifty little machine, to be sure. > > (it's a touch smaller than i imagined, > > and much thinner than a paperback.) > > > > the display contrast was fine for me, > > and i didn't mind the reverse-flash > > that occurs on every screen-change, > > or the slowness of the refresh-speed. > > > > once they're engrossed in _reading_, > > most readers love e-book-machines, > > and there is nothing about this one > > that'd prevent that from happening. > > > > most of the interface was quite clear, > > and i'd guess that the rest would be > > easy enough to learn and remember. > > > > the absence of search remains huge. > > > > it was entertaining to hear sony reps > > answer questions about that "other" > > machine -- the one from amazon -- > > but i'd guess they know they need to > > put _some_ kind of web-connection > > in their next version, or be relegated > > to the alsoran-didnotfinish category. > > > > of course, that just brings up the next > > hard question, which is how they will > > manage to build up a catalog as large > > as the one that amazon already has... > > > > but here's hoping they can be healthy > > competition against the kindle, or else > > we won't see price-drops until 2012... > > > > -bowerbird > > > > > > > > ************** > > Need a new ride? Check out the largest site for U.S. used car > > listings at AOL Autos. > > > > (http://autos.aol.com/used?NCID=aolcmp00300000002851) > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080430/57d2d8cf/attachment.htm From walter.van.holst at xs4all.nl Wed Apr 30 10:04:28 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Wed, 30 Apr 2008 19:04:28 +0200 Subject: [gutvol-d] New info on GutenMark software In-Reply-To: <1209567138.19489.14.camel@software1.heads-up.local> References: <1209567138.19489.14.camel@software1.heads-up.local> Message-ID: <4818A69C.202@xs4all.nl> Ron Burkey wrote: > Hi All, > > I'm sure that some of you are aware of the open-source software called > "GutenMark" (http://www.sandroid.org/GutenMark/download.html) which > attempts to automatically convert PG plain-text files into HTML or > LaTeX. I've recently made fairly significant improvements to the > program, and it has somewhat belatedly occurred to me that it might be > good to let you know about them. Looks interesting. Are there any plans for a) building a webservice on top of this and b) have TEI-output? Regards, Walter From hart at pglaf.org Wed Apr 30 10:05:55 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 30 Apr 2008 10:05:55 -0700 (PDT) Subject: [gutvol-d] i saw a sony reader this weekend In-Reply-To: References: Message-ID: On Wed, 30 Apr 2008, Jared Buck wrote: > I don't need an expensive sony reader to portably take ebooks with me. A > Palm handheld works just as well :) > I agree in general, but I do like the battery life of the Sony, and the "page" CAN be easier to read, if you set up right. mh > On Wed, Apr 30, 2008 at 10:37 AM, Michael Hart wrote: > >> >> I have played with several versions of the Sony, >> the newer ones are actually better, more options >> and features, the older ones sucked so bad I had >> to wonder how they ever got enough nerve to make >> the newer models. >> >> However, sales are SOOO bad that Sony won't tell >> how mnay have sold, and perhaps same with others. >> >> mh >> >> >> On Tue, 29 Apr 2008, Bowerbird at aol.com wrote: >> >>> i saw a sony reader this weekend >>> at the l.a. times festival of books... >>> >>> it's a nifty little machine, to be sure. >>> (it's a touch smaller than i imagined, >>> and much thinner than a paperback.) >>> >>> the display contrast was fine for me, >>> and i didn't mind the reverse-flash >>> that occurs on every screen-change, >>> or the slowness of the refresh-speed. >>> >>> once they're engrossed in _reading_, >>> most readers love e-book-machines, >>> and there is nothing about this one >>> that'd prevent that from happening. >>> >>> most of the interface was quite clear, >>> and i'd guess that the rest would be >>> easy enough to learn and remember. >>> >>> the absence of search remains huge. >>> >>> it was entertaining to hear sony reps >>> answer questions about that "other" >>> machine -- the one from amazon -- >>> but i'd guess they know they need to >>> put _some_ kind of web-connection >>> in their next version, or be relegated >>> to the alsoran-didnotfinish category. >>> >>> of course, that just brings up the next >>> hard question, which is how they will >>> manage to build up a catalog as large >>> as the one that amazon already has... >>> >>> but here's hoping they can be healthy >>> competition against the kindle, or else >>> we won't see price-drops until 2012... >>> >>> -bowerbird >>> >>> >>> >>> ************** >>> Need a new ride? Check out the largest site for U.S. used car >>> listings at AOL Autos. >>> >>> (http://autos.aol.com/used?NCID=aolcmp00300000002851) >>> >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d at lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d >> > From walter.van.holst at xs4all.nl Wed Apr 30 10:17:16 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Wed, 30 Apr 2008 19:17:16 +0200 Subject: [gutvol-d] i saw a sony reader this weekend In-Reply-To: References: Message-ID: <4818A99C.4030807@xs4all.nl> Michael Hart wrote: >> I don't need an expensive sony reader to portably take ebooks with me. A >> Palm handheld works just as well :) >> > > I agree in general, but I do like the battery life of the Sony, > and the "page" CAN be easier to read, if you set up right. Having had an iRex iLiad for a few months now, I can say that apart from its battery life I am as happy as a clam about it. Regards, Walter From Bowerbird at aol.com Wed Apr 30 10:24:43 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Apr 2008 13:24:43 EDT Subject: [gutvol-d] New info on GutenMark software Message-ID: great news ron! :+) -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080430/0eace44c/attachment.htm From Bowerbird at aol.com Wed Apr 30 10:38:06 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Apr 2008 13:38:06 EDT Subject: [gutvol-d] i saw a sony reader this weekend Message-ID: michael said: > However, sales are SOOO bad that Sony won't tell > how many have sold, and perhaps same with others. when people asked the sony rep about their poor sales relative to the kindle, he responded "we've sold as many as they've sold." well, maybe... but sony hasn't sold _nearly_ as many e-books to those machines. and the e-books is where they're gonna make their money from... besides, sony has dipped its toe into lots of pools and withdrew, so we can certainly question if they'll stick with this product-line. if the machine didn't have the solid backing of the c.e.o, it would probably have been dropped already, based on its lethargic start. whereas amazon knows it has to stick with the kindle regardless. there's still just one producer of the e-ink screens, maybe two, if the new one can get themselves up to speed quickly enough, so it'll be a very interesting tug-of-war on that particular front. like i said, i hope sony continues. i wouldn't bet money on them. but i do hope they stick it out. still, honestly, i'd buy a kindle first. more books, easier access. and less fear of any root-kit tricks... nonetheless, as a _machine_, the sony reader looked just fine... (although i believe you, michael, that the earlier model sucked.) -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080430/86944f41/attachment.htm From rburkey2005 at earthlink.net Wed Apr 30 10:59:29 2008 From: rburkey2005 at earthlink.net (Ron Burkey) Date: Wed, 30 Apr 2008 12:59:29 -0500 Subject: [gutvol-d] New info on GutenMark software In-Reply-To: <4818A69C.202@xs4all.nl> References: <1209567138.19489.14.camel@software1.heads-up.local> <4818A69C.202@xs4all.nl> Message-ID: <1209578369.19489.16.camel@software1.heads-up.local> No. A lot of other people have expressed interest in doing things like that in the past, though I'm not sure if any of them have accomplished it. The program has been around for 5-6 years; it's just the GUI and installers that are new. -- Ron On Wed, 2008-04-30 at 19:04 +0200, Walter van Holst wrote: > Ron Burkey wrote: > > Hi All, > > > > I'm sure that some of you are aware of the open-source software called > > "GutenMark" (http://www.sandroid.org/GutenMark/download.html) which > > attempts to automatically convert PG plain-text files into HTML or > > LaTeX. I've recently made fairly significant improvements to the > > program, and it has somewhat belatedly occurred to me that it might be > > good to let you know about them. > > Looks interesting. Are there any plans for a) building a webservice on > top of this and b) have TEI-output? > > Regards, > > Walter > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From ralf at ark.in-berlin.de Wed Apr 30 11:21:51 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Wed, 30 Apr 2008 20:21:51 +0200 Subject: [gutvol-d] i saw a sony reader this weekend In-Reply-To: <4818A99C.4030807@xs4all.nl> References: <4818A99C.4030807@xs4all.nl> Message-ID: <20080430182151.GB30022@ark.in-berlin.de> > Having had an iRex iLiad for a few months now, I can say that apart from > its battery life I am as happy as a clam about it. My reading volume has doubled or trebled with this device, also a fine thing to present sudokus for solving (using touchpad). Guess whether I still buy dead trees. Regards, ralf From ralf at ark.in-berlin.de Wed Apr 30 11:15:09 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Wed, 30 Apr 2008 20:15:09 +0200 Subject: [gutvol-d] New info on GutenMark software In-Reply-To: <4818A69C.202@xs4all.nl> References: <1209567138.19489.14.camel@software1.heads-up.local> <4818A69C.202@xs4all.nl> Message-ID: <20080430181509.GA30022@ark.in-berlin.de> > Looks interesting. Are there any plans for a) building a webservice on > top of this and b) have TEI-output? In principle b) is already there with the 'download TEI' option you get in the PP stage of PGDP. Unfortunately, with the latest fix, one bug (/wrt poetry lines) crept in due to different PHP versions (I think) but everything else should work. If I get to it, I'll soon adapt the Wiki documentation to the new situation. Regards, ralf From joshua at hutchinson.net Wed Apr 30 11:47:34 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Wed, 30 Apr 2008 18:47:34 +0000 (GMT) Subject: [gutvol-d] i saw a sony reader this weekend Message-ID: <1523164629.60891209581254601.JavaMail.mail@webmail06> Not really, Jared. I know...I use one, too (well, an old Windows CE palm device, but similar). The advantage that the Sony Reader and the Kindle have is the display. And once you're talking about long periods of reading, you notice the difference. Josh On Apr 30, 2008, jared.buck at gmail.com wrote: I don't need an expensive sony reader to portably take ebooks with me. A Palm handheld works just as well :) From dlowry8 at comcast.net Wed Apr 30 11:47:07 2008 From: dlowry8 at comcast.net (Douglas Lowry) Date: Wed, 30 Apr 2008 14:47:07 -0400 Subject: [gutvol-d] I saw a Sony reader this weekend Message-ID: <001b01c8aaf2$9987cac0$6405a8c0@dlowry> Does anyone know a suitable contact person within Sony (or among the Amazon Kindle people) in case they were to be interested in moving beyond CTRL-F search to a full research-quality search system? There are over 14000 Project Gutenberg texts, all nicely searchable, for free download at www.WCTSE.com. Full details and an illustrated manual are available at www.WordsCloseTogether.com. If anyone cares to look at the manual and try any of the books, I would be very interested in your feedback. Incidentally, indexing books is a lovely way of discovering errors in them! Doug Lowry dlowry at alum.mit.edu / dlowry8 at comcast.net P.S.: I corrected the two spelling errors in the heading. I wonder if that will make this open as a new discussion thread? ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080430/008c043b/attachment.htm From Bowerbird at aol.com Wed Apr 30 12:40:36 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Apr 2008 15:40:36 EDT Subject: [gutvol-d] i saw a sony reader this weekend Message-ID: walter said: > Having had an iRex iLiad for a few months now, > I can say that apart from its battery life > I am as happy as a clam about it. the bigger screen certainly looks appealing to me. and i'd guess that it is worth the extra bulkiness... and the touchscreen is a plus in the age of iphone. at the same time, the lack of any color is a bummer. however, i don't think it's available in the u.s. and the pricetag is a bit of a problem as well, especially considering how week the dollar is. -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080430/95c3f63a/attachment.htm From Bowerbird at aol.com Wed Apr 30 12:45:37 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Apr 2008 15:45:37 EDT Subject: [gutvol-d] I saw a Sony reader this weekend Message-ID: doug said: > If anyone cares to look at the manual > and try any of the books, I would be > very interested in your feedback. i looked at your site a while back... did i send you any feedback then? if not, i'll go back and look at it again. > Incidentally, indexing books is > a lovely way of discovering errors in them! i bet it is... :+) > I corrected the two spelling errors in the heading. i think you probably meant to say "casing errors"... but i assure you they are what the author intended. ;+) -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080430/380e21a4/attachment.htm From Bowerbird at aol.com Wed Apr 30 12:52:41 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Apr 2008 15:52:41 EDT Subject: [gutvol-d] organizing the public-domain library Message-ID: who is this programmer stealing my thunder? ;+) > http://www.boingboing.net/2008/04/25/voluminous-app-for-o.html oh well, mac-only, and then just 10.5, so i guess he's no competition... :+) -bowerbird ************** Need a new ride? Check out the largest site for U.S. used car listings at AOL Autos. (http://autos.aol.com/used?NCID=aolcmp00300000002851) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080430/88edde60/attachment.htm