From Bowerbird at aol.com Tue Sep 2 10:38:09 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 2 Sep 2008 13:38:09 EDT Subject: [gutvol-d] the heritage of the desert -- 001 Message-ID: welcome back from your labor day vacation-time! i've prepared a new analysis for your consideration. *** ok, here's "the heritage of the desert", by zane grey. > http://www.gutenberg.org/files/1262/ it's yet another book that was recently "reposted": > Corrections have been made in this file > and it has been updated with new header, > removed from its old address in etext98, > and filed under the new directory system. > An html file has been provided. very much like the earlier "reposted" e-text i examined, analysis shows that many errors were corrected, _but_ a big number of easy-to-detect errors were not fixed. *** this book was first posted in 1999, then updated in 2004, with that 2004 update correcting a fair number of errors... still, this 2008 reposting repaired dozens of bureaucratic errors (e.g., ellipses), plus _three_ substantive corrections: > head, and offered up a brief prayer, beautiful in its simplicty and > head, and offered up a brief prayer, beautiful in its simplicity and > http://z-m-l.com/go/thotd/thotdp017.html > sagespotted waste for Holderness's ranch. He located it, a black patch > sage-spotted waste for Holderness's ranch. He located it, a black patch > http://z-m-l.com/go/thotd/thotdp149.html > them, a ruddyfaced fellow, walked toward Mescal. > them, a ruddy-faced fellow, walked toward Mescal. > http://z-m-l.com/go/thotd/thotdp269.html *** however... ...many easy-to-detect errors remain in the "reposted" text... for instance, here's an incorrect word that was not fixed: > the threshold, followed by tall young men and rosy-checked girls and > the threshold, followed by tall young men and rosy-cheeked girls and > http://z-m-l.com/go/thotd/thotdp017.html and here's a p-book error that should've been corrected: > "It's lost, surely. I can t even see the tip of the peak that stood so > "It's lost, surely. I can't even see the tip of the peak that stood so > http://z-m-l.com/go/thotd/thotdp231.html furthermore, unlike the second and third changes listed above, here's an incorrectly-treated p-book hyphenate that wasn't fixed: > between her and my womenfolk. The old antagonism is gone. Well, well, > between her and my women-folk. The old antagonism is gone. Well, well, > http://z-m-l.com/go/thotd/thotdp245.html there are 4 other cases of "women-folk" in the book, no other "womenfolk". here are three more incorrectly-treated hyphenates that weren't fixed: > easy--soho!" cried Naab to his steeds. In the pitchy blackness under the > easy--so-ho!" cried Naab to his steeds. In the pitchy blackness under the > http://z-m-l.com/go/thotd/thotdp049.html there are 2 other cases of "so-ho", but no other cases of "soho". > the base of the wall. The tracks of the wildhorse band were very fresh > the base of the wall. The tracks of the wild-horse band were very fresh > http://z-m-l.com/go/thotd/thotdp094.html > melons were ripe and luscious. Midsummer was vacationtime for the > melons were ripe and luscious. Midsummer was vacation-time for the > http://z-m-l.com/go/thotd/thotdp135.html *** in evaluating the "ruddy-faced" change, i reviewed other "faced" instances. sure enough, we find a good many that include the hyphen: > woolly sheep that added his baa-baa to the din, and a bald-faced burro > "Now mind you, I'll take a bead on this white-faced spy if you send him > A red-faced ranger with sandy hair and twinkling eyes appeared. > father's house with him; and she had remained in the room, white-faced, > barriers, nor the mesas and domes as black-faced death, nor the > leading the horses--a slender, clean-faced, dark-haired man--Dene! The > heap. George and Billy bent over Dave, who sat white-faced against the > He leaned against a tree in the shadow and watched the gray-faced giant > them, a ruddy-faced fellow, walked toward Mescal. > a dark-red blot staining his gray shirt. Flinty-faced Mormons, ruthless however, we also find two instances that do _not_ include a hyphen: > a hundred head. The barefaced robber sold them in Lund to a buying > "No, only it makes this difference: both things will then be barefaced since the last two are out of sync with the rest of the book, i'd correct them... *** back in the realm of "bureaucratic changes", there were several unfixed errors: here is a case of a three-dash sequence (actually, two of them consecutively): > "--- --- you Mormons! See him! Paul Caldwell! Son of a Bishop! Thought there are 9 em-dashes with a non-f.a.q.-compliant space following them: > hope so-- You're quite pale." > the bear from-- Why Mescal! you're white--you're shaking. There's no > Piute added his encomium: "Damn--heap big bear-- Jack kill um--big > you knew. I'm wild-- I'm starved for a sight of you. I love you! Mescal, > a risk I'm putting on you! But I couldn't help it. Look at me-- Just > once--please-- Mescal, just one look.... Now go." > "Father!-- Father!" she panted. "Come--quick--the rustlers!--the > "Plan?-- Yes. Hide Bolly and Silvermane in the little arbor down in > man than you. Your work, your religion, your life-- Why! I've no words then there's this ellipse which contains spaces between its dots: > hands--there. . . . Say! Naab, d--n you, her wrists are black an' blue!" *** in addition, there are 4 paragraphing errors in this reposted e-text: > Twice the workers saw Silvermane standing on open high ridges, > http://z-m-l.com/go/thotd/thotdp102.html > "What luck!" Hare muttered through clinched teeth, > http://z-m-l.com/go/thotd/thotdp235.html > "I give up Silver Cup and my stock. Maybe that will con- > http://z-m-l.com/go/thotd/thotdp244.html > When he raised his face from the tumbling mass of her > http://z-m-l.com/go/thotd/thotdp287.html *** there are also at least 7 cases of _italics_ that were missed: > _"Ki--yi-i-i!"_ yelled Dave Naab with all the power of his lungs. His head > http://z-m-l.com/go/thotd/thotdp158.html > _"Dene!"_ burst from Hare, in a whisper. > http://z-m-l.com/go/thotd/thotdp248.html > _"H--l!"_ he shrieked. > http://z-m-l.com/go/thotd/thotdp273.html > aid Mescal in every way to some safe hiding-place, and _then_ to seek > http://z-m-l.com/go/thotd/thotdp276.html > _"Holderness!"_ > http://z-m-l.com/go/thotd/thotdp281.html > _"Boy!_Boy!_ You've robbed me." Naab waved his arm from the gaping crowd > http://z-m-l.com/go/thotd/thotdp291.html > hand. "August. See, the Bishop's coming. Paul's _father!_ Do you hear?" > http://z-m-l.com/go/thotd/thotdp292.html *** finally, there are two full lines -- from the top of page 27 -- that are missing: > there's any law here. Listen. This desert belongs to the > Mormons, We found the springs, dug the ditches. No > http://z-m-l.com/go/thotd/thotdp027.html *** anyway, that is enough for today... i'll wrap up this analysis tomorrow. -bowerbird ************** It's only a deal if it's where you want to go. Find your travel deal here. (http://information.travel.aol.com/deals?ncid=aoltrv00050000000047) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Tue Sep 2 11:26:24 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 2 Sep 2008 11:26:24 -0700 Subject: [gutvol-d] the heritage of the desert -- 001 References: Message-ID: <001601c90d29$682846a0$6401a8c0@ahainesp2400> Re the spaced em-dashes - there doesn't appear to be anything wrong with most, if not all, of these. Em-dashes (or double em-dashes) at the end of sentences, are quite common. They are used where the speaker or thought is "trailing off", rather than ending definitely, as with a period or other full stop. As such, they get a trailing space (or two, if you use two spaces after a sentence-ending full stop), unless at the end of a quoted passage. Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Tuesday, September 02, 2008 10:38 AM Subject: [gutvol-d] the heritage of the desert -- 001 welcome back from your labor day vacation-time! i've prepared a new analysis for your consideration. *** ok, here's "the heritage of the desert", by zane grey. > http://www.gutenberg.org/files/1262/ it's yet another book that was recently "reposted": > Corrections have been made in this file > and it has been updated with new header, > removed from its old address in etext98, > and filed under the new directory system. > An html file has been provided. very much like the earlier "reposted" e-text i examined, analysis shows that many errors were corrected, _but_ a big number of easy-to-detect errors were not fixed. *** this book was first posted in 1999, then updated in 2004, with that 2004 update correcting a fair number of errors... still, this 2008 reposting repaired dozens of bureaucratic errors (e.g., ellipses), plus _three_ substantive corrections: > head, and offered up a brief prayer, beautiful in its simplicty and > head, and offered up a brief prayer, beautiful in its simplicity and > http://z-m-l.com/go/thotd/thotdp017.html > sagespotted waste for Holderness's ranch. He located it, a black patch > sage-spotted waste for Holderness's ranch. He located it, a black patch > http://z-m-l.com/go/thotd/thotdp149.html > them, a ruddyfaced fellow, walked toward Mescal. > them, a ruddy-faced fellow, walked toward Mescal. > http://z-m-l.com/go/thotd/thotdp269.html *** however... ...many easy-to-detect errors remain in the "reposted" text... for instance, here's an incorrect word that was not fixed: > the threshold, followed by tall young men and rosy-checked girls and > the threshold, followed by tall young men and rosy-cheeked girls and > http://z-m-l.com/go/thotd/thotdp017.html and here's a p-book error that should've been corrected: > "It's lost, surely. I can t even see the tip of the peak that stood so > "It's lost, surely. I can't even see the tip of the peak that stood so > http://z-m-l.com/go/thotd/thotdp231.html furthermore, unlike the second and third changes listed above, here's an incorrectly-treated p-book hyphenate that wasn't fixed: > between her and my womenfolk. The old antagonism is gone. Well, well, > between her and my women-folk. The old antagonism is gone. Well, well, > http://z-m-l.com/go/thotd/thotdp245.html there are 4 other cases of "women-folk" in the book, no other "womenfolk". here are three more incorrectly-treated hyphenates that weren't fixed: > easy--soho!" cried Naab to his steeds. In the pitchy blackness under the > easy--so-ho!" cried Naab to his steeds. In the pitchy blackness under the > http://z-m-l.com/go/thotd/thotdp049.html there are 2 other cases of "so-ho", but no other cases of "soho". > the base of the wall. The tracks of the wildhorse band were very fresh > the base of the wall. The tracks of the wild-horse band were very fresh > http://z-m-l.com/go/thotd/thotdp094.html > melons were ripe and luscious. Midsummer was vacationtime for the > melons were ripe and luscious. Midsummer was vacation-time for the > http://z-m-l.com/go/thotd/thotdp135.html *** in evaluating the "ruddy-faced" change, i reviewed other "faced" instances. sure enough, we find a good many that include the hyphen: > woolly sheep that added his baa-baa to the din, and a bald-faced burro > "Now mind you, I'll take a bead on this white-faced spy if you send him > A red-faced ranger with sandy hair and twinkling eyes appeared. > father's house with him; and she had remained in the room, white-faced, > barriers, nor the mesas and domes as black-faced death, nor the > leading the horses--a slender, clean-faced, dark-haired man--Dene! The > heap. George and Billy bent over Dave, who sat white-faced against the > He leaned against a tree in the shadow and watched the gray-faced giant > them, a ruddy-faced fellow, walked toward Mescal. > a dark-red blot staining his gray shirt. Flinty-faced Mormons, ruthless however, we also find two instances that do _not_ include a hyphen: > a hundred head. The barefaced robber sold them in Lund to a buying > "No, only it makes this difference: both things will then be barefaced since the last two are out of sync with the rest of the book, i'd correct them... *** back in the realm of "bureaucratic changes", there were several unfixed errors: here is a case of a three-dash sequence (actually, two of them consecutively): > "--- --- you Mormons! See him! Paul Caldwell! Son of a Bishop! Thought there are 9 em-dashes with a non-f.a.q.-compliant space following them: > hope so-- You're quite pale." > the bear from-- Why Mescal! you're white--you're shaking. There's no > Piute added his encomium: "Damn--heap big bear-- Jack kill um--big > you knew. I'm wild-- I'm starved for a sight of you. I love you! Mescal, > a risk I'm putting on you! But I couldn't help it. Look at me-- Just > once--please-- Mescal, just one look.... Now go." > "Father!-- Father!" she panted. "Come--quick--the rustlers!--the > "Plan?-- Yes. Hide Bolly and Silvermane in the little arbor down in > man than you. Your work, your religion, your life-- Why! I've no words then there's this ellipse which contains spaces between its dots: > hands--there. . . . Say! Naab, d--n you, her wrists are black an' blue!" *** in addition, there are 4 paragraphing errors in this reposted e-text: > Twice the workers saw Silvermane standing on open high ridges, > http://z-m-l.com/go/thotd/thotdp102.html > "What luck!" Hare muttered through clinched teeth, > http://z-m-l.com/go/thotd/thotdp235.html > "I give up Silver Cup and my stock. Maybe that will con- > http://z-m-l.com/go/thotd/thotdp244.html > When he raised his face from the tumbling mass of her > http://z-m-l.com/go/thotd/thotdp287.html *** there are also at least 7 cases of _italics_ that were missed: > _"Ki--yi-i-i!"_ yelled Dave Naab with all the power of his lungs. His head > http://z-m-l.com/go/thotd/thotdp158.html > _"Dene!"_ burst from Hare, in a whisper. > http://z-m-l.com/go/thotd/thotdp248.html > _"H--l!"_ he shrieked. > http://z-m-l.com/go/thotd/thotdp273.html > aid Mescal in every way to some safe hiding-place, and _then_ to seek > http://z-m-l.com/go/thotd/thotdp276.html > _"Holderness!"_ > http://z-m-l.com/go/thotd/thotdp281.html > _"Boy!_Boy!_ You've robbed me." Naab waved his arm from the gaping crowd > http://z-m-l.com/go/thotd/thotdp291.html > hand. "August. See, the Bishop's coming. Paul's _father!_ Do you hear?" > http://z-m-l.com/go/thotd/thotdp292.html *** finally, there are two full lines -- from the top of page 27 -- that are missing: > there's any law here. Listen. This desert belongs to the > Mormons, We found the springs, dug the ditches. No > http://z-m-l.com/go/thotd/thotdp027.html *** anyway, that is enough for today... i'll wrap up this analysis tomorrow. -bowerbird ************** It's only a deal if it's where you want to go. Find your travel deal here. (http://information.travel.aol.com/deals?ncid=aoltrv00050000000047) _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Tue Sep 2 12:20:12 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 2 Sep 2008 15:20:12 EDT Subject: [gutvol-d] the heritage of the desert -- 001 Message-ID: al said: > Em-dashes (or double em-dashes) at the end of sentences, > are quite common.? They are used where the speaker or thought > is "trailing off", rather than ending definitely, as with a period > or other full stop.? As such, they get a trailing space ok, fair enough, i can buy that. :+) fortunately for me, i put a space _before_and_after_ every em-dash, precisely so i don't need to make subjective judgment calls like that. life's too short to try to read the minds of authors dead for 69 years. i find that it also improves rewrapping results when the em-dash can _float_, either to the end of one line or the beginning of the next one. with p.g. e-texts, a mid-sentence em-dash will sometimes join words that are each already long, forming a super-long mega-word, so that rewrapping produces an artificially-short line before that mega-word. and then over at distributed proofreaders, they get the opposite effect, where their convention of "clothing" the end-line em-dashes produces an artificially-long top line, with the next line then being way too short. at any rate, i'll be happy to take those 9 lines out of the "error" column... *** if anybody questions any of the remaining ~20 errors, please do speak up. and, of course, if anyone wants to discuss the _larger_ issues here, do so... -bowerbird ************** It's only a deal if it's where you want to go. Find your travel deal here. (http://information.travel.aol.com/deals?ncid=aoltrv00050000000047) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Tue Sep 2 20:22:25 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 2 Sep 2008 20:22:25 -0700 Subject: [gutvol-d] the heritage of the desert -- 001 References: Message-ID: <000d01c90d74$4a06b210$6401a8c0@ahainesp2400> An em-dash acting as a stop-mark should *NEVER* be split from its sentence. All that does is turn it into an spaced em-dash, with no function. By the same illogic, you might as well split periods and other stop-marks from their sentence and let them float, too. ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Tuesday, September 02, 2008 12:20 PM Subject: Re: [gutvol-d] the heritage of the desert -- 001 al said: > Em-dashes (or double em-dashes) at the end of sentences, > are quite common. They are used where the speaker or thought > is "trailing off", rather than ending definitely, as with a period > or other full stop. As such, they get a trailing space ok, fair enough, i can buy that. :+) fortunately for me, i put a space _before_and_after_ every em-dash, precisely so i don't need to make subjective judgment calls like that. life's too short to try to read the minds of authors dead for 69 years. i find that it also improves rewrapping results when the em-dash can _float_, either to the end of one line or the beginning of the next one. with p.g. e-texts, a mid-sentence em-dash will sometimes join words that are each already long, forming a super-long mega-word, so that rewrapping produces an artificially-short line before that mega-word. and then over at distributed proofreaders, they get the opposite effect, where their convention of "clothing" the end-line em-dashes produces an artificially-long top line, with the next line then being way too short. at any rate, i'll be happy to take those 9 lines out of the "error" column... *** if anybody questions any of the remaining ~20 errors, please do speak up. and, of course, if anyone wants to discuss the _larger_ issues here, do so... -bowerbird ************** It's only a deal if it's where you want to go. Find your travel deal here. (http://information.travel.aol.com/deals?ncid=aoltrv00050000000047) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Sep 2 23:55:57 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Sep 2008 02:55:57 EDT Subject: [gutvol-d] the heritage of the desert -- 001 Message-ID: al said: > An em-dash acting as a stop-mark should?*NEVER* be split from its sentence. thanks for your input. i've decided differently, for my cyberlibrary, and i'm sorry if that offends your sensibilities, but thanks anyway... > All that does is turn it into an spaced em-dash, with no function. i submit that its "function" is the same as it ever was, that is, to indicate a pause of indeterminate length... but, again, there's no real purpose in _debating_ it. because first of all, i'm not trying to change your mind, and second, i quite expect you will never change mine... so, until we have discussed all the more-important things first, i don't see making the discussion of this anything of a priority... -bowerbird p.s. my spell-checker stumbled on your "an spaced em-dash" phrase; it wants to change that "an" to an "a"... ************** It's only a deal if it's where you want to go. Find your travel deal here. (http://information.travel.aol.com/deals?ncid=aoltrv00050000000047) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Sep 3 13:19:49 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Sep 2008 16:19:49 EDT Subject: [gutvol-d] the heritage of the desert -- 002 Message-ID: we're reviewing the reposted "the heritage of the desert". > http://www.gutenberg.org/files/1262/ analysis showed that many errors were corrected, but a big number of easy-to-detect errors were not fixed. *** i won't review all the errors here, but will just note 2 more: first there was a paragraphing error on page 133: > it was mescal's pet. > http://z-m-l.com/go/thotd/thotdp133.html second, there was a consistency error on page 92: > VIII. > > THE BREAKER OF WILD MUSTANGS > http://z-m-l.com/go/thotd/thotdp107.html there were only _2_ blank lines above this chapter header, compared to _4_ blank lines above every other such header. *** oh, by the way, if you looked at this book on my site yesterday, you saw that i was using the o.c.r. text, which had lots of errors. the thing i liked, though, was that it had the original linebreaks. the p.g. e-text, on the other hand, had been rewrapped. so i joined the o.c.r. version with the p.g. version so that _now_ i have clean e-text _and_ the original linebreaks. that's sweet... since this version of the text can be compared so easily to scans -- just click through the pages on my site -- i believe that it has significantly more value than the p.g. version, even if the text is word-for-word identical. the scans and exactly-matching-text form synergy together that makes 'em worth more than its pieces. so at this point, i fail to have any use for the rewrapped p.g. text... (i'm finally getting the hang of how to reintroduce linebreaks, although there are still a few little bugs lurking in my routines. but when i get this process down, it's gonna give me godspeed, since i'll be able to correct o.c.r. from the o.c.a. with p.g. e-texts, with o.c.a. scan-sets serving as backup for fairly clean text. yes!) *** i should also point out that i didn't have the trusty p1 and p2 proofers from d.p. going over this book and finding all the errors in it for me... i had to find them myself. and i didn't do any word-for-word proofing. i'm guessing those proofers might have maybe found even more errors. and probably jose could have found some more errors even after _that_. but that's beside the point. i've found enough errors to make the point, and that's the point. *** so, what is the point? well, i'm glad you asked. but first let me say "first things first"... *** first things first, the whitewashers are great people. hard workers. dedicated volunteers. they deserve a wholeheckuva lot of _thanks_. they give long hours of labor, probably more than is good for them. if you think this is about them, you need to take a nice long walk... this is about what _project_gutenberg_ needs to be thinking about. needs to be talking about. and needs to be _doing_things_ about... *** i didn't set out to explore this territory. i picked a "reposted" book, because wikisource john asked me to discuss the issue of revisions, change-logs, and that general topic, and i figured a reposted book would be a nice "march to perfection" that would serve as a good arena for discussion of that whole sphere. i was as surprised as you when i looked at these "reposted" e-texts and discovered that they weren't "perfect" (or something close to it), that they still had some holes that could be detected programmatically. this honestly surprised me. i thought the whitewashers were god. :+) but no, seriously, it honestly surprised me. i didn't expect it... and, to be frank, it's kind of disappointing. the "new" filing system was introduced way back in 2003. 2003. that was 5 years ago now. so for 5 years, we've been waiting while the pre-#10000 e-texts have been dribbled into the "new" filing system. the wait was slow, but -- like many others, i'm sure -- i thought it would be "worth it" in the end because the e-texts would finally be _cleaned_of_errors_. now that i'm finding that that's not the case, not really, i'm wondering why we had to wait so long to have a filing system we can depend on. especially since we _still_ have to do a final cleaning on those e-texts. *** let me be clear on another thing. after i've done one more of these "reposted" books, i'll be done with this topic. all done. finished too. when the "powers that be" over at distributed proofreaders stuck their head in the sand on the issue of preprocessing, i continued to write post after post, and provide example after example of its vast benefits. what a stunning waste of time, since they just _continued_ to ignore it. so i spent _years_ of my time -- quite literally -- "proving" the obvious. i'm not going to do the same here. if the "powers that be" here at p.g. want to stick their head in the sand about this patently obvious problem, they will continue to do it, no matter how much evidence i muster on it... so one more "reposted" book, and then i'm done wasting my time on it. *** ok, so what are the points here? well, like i said, they are _obvious_, so i'm sure any sentient creatures out there can figure them out, but since it's my job to say that the emperor has no clothes, here goes: 1. the whitewashers need to up the quality of their clean-up tools. the errors i've pointed out should help provide direction to that task. 2. everybody needs a quality-control check behind them. everybody. even the people who are "the last line of defense" for everyone else... perhaps _especially_ those people. this is a _workflow_ consideration. for every step in your workflow, you need to do quality-control on it... 3. project gutenberg needs to incorporate its user-base as workers. my goodness, p.g. has the largest audience of any cyberlibrary, yet seems incapable of using all those eyeballs to make its bugs shallow. and yes, there are problems in taking error-reports from people who might not have the materials to validate those reports beforehand... but, given the extremely sluggish response generated by bug-reports, the assumption of the "powers that be" seems to me that error-reports are -- by the very nature -- flawed. in an age where literally _millions_ of scan-sets are floating around online, it would seem wise to presume that the error-reports you receive _could_ be grounded in cold hard fact. 4. p.g. needs to _allow_ its user-base to be incorporated as workers... this is a different point from #3, yet it's just as important, maybe more. the "powers that be" have set up a fairly cumbersome "update" process. it will probably have to be reworked so that end-users can participate... but the whitewashers don't seem to be too willing to relinquish anything. you don't need to turn the library into a full-fledged wiki, where anyone can edit anything. but you do have to let them _suggest_ such a change. and, if there is only one error in a book, you have to be willing to fix it; in the past, there seems to have been an attitude of waiting until there are "enough" changes to warrant going through the process of updating. 5. p.g. needs to aggressively find and post scan-sets to back its e-texts. if you want to receive error-reports that are better-informed, give people the _means_ to validate their error-reports against an existing scan-set... have you noticed how easy it has been for me to talk about specific errors when i can simply give you a u.r.l. where the scan of the page is displayed? i'll stop there, i guess... i could get much more specific about specifics of the revision procedure -- i have in the past, if you care to dip into the archives -- but realistically, until the _mentality_ changes at p.g., there's little reason to talk specifics... ok, so then maybe i should add a couple points... 6. p.g. needs to have an open dialog about error-reports and revisions. (of course, if they really wanted that, they could pick up on my lead here and have a discussion right now, so it's easy to see they have no interest.) 7. p.g. needs to _welcome_ people who can show how to do it better... so there you go. that's your plan of attack, if you actually care to enact it. or you can just stick your head back in the sand and ignore those errors... makes no difference to me, since i can find and fix them in _my_ version... -bowerbird p.s. finally, a little reminder-note to give perspective on this whole issue. back in march, al did an examination on an e-text that i prepped for p.g. well, jose menedez did most of the exam, but al chimed in at the end with: > My take on this submission? Given the number of errors/inconsistencies > found with assorted utilities (I've lost count, but at least 6-8 items, I think, > so far), I can only assume there are others, possibly findable only with a > proper proof-reading. If this had been my submission, 6-8 errors is > 6-8 too many, and I would not have submitted as it stands. well, al, just so's you know, i just pointed to 20+ errors in a "reposted" book... ************** It's only a deal if it's where you want to go. Find your travel deal here. (http://information.travel.aol.com/deals?ncid=aoltrv00050000000047) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 5 01:18:36 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 5 Sep 2008 04:18:36 EDT Subject: [gutvol-d] frivolous cupid -- 001 Message-ID: ok, today we're looking at our third (and final) "reposted" e-text... this one is called "frivolous cupid", and it was reposted by al haines. we wouldn't want al to feel left out of this little exercise, would we? this was a _very_ early e-text, #428. > http://www.gutenberg.org/files/428/ this is its first update since it was posted, on christmas day, in 1995! though it's fashionable to blast these early e-texts as "error-ridden", once again we have solid evidence that it's simply not true, not at all. in fact -- considering how far back 1995 goes in cyberspace time -- the initial producer of this e-text should be _heartily_congratulated_ for doing such a fine job... wasn't perfect, no, but pretty darn close... *** so first let's look at the changes al made to the "reposted" e-text... 7 corrections were made from the "old" e-text to the updated one: 1. global change of all instances of "- " to "-" 2. global change of all 83 instances of "`" to "'" 3. a paragraphing correction: > I questioned the porter and found that the two ladies had 4. another paragraphing correction: > "Then find me somebody else," said Deodonato; "and pray leave me. 5. a paragraph-termination correction: > In the coldest of voices she said; > In the coldest of voices she said: 6. a doublequote correction: > It is a most anxious thing to be an absolute ruler," said Duke > "It is a most anxious thing to be an absolute ruler," said Duke 7. another doublequote correction: > "Nay, if you proposed marriage," she shall marry you," said Deodonato. > "Nay, if you proposed marriage, she shall marry you," said Deodonato. like i said, this text appears to have been very clean... *** oh yeah, let me mention one more thing that really bugs me about this... it appears that _rewrapping_ the text is routinely done on "reposting". why? it just makes it more difficult to see what revisions were made... especially on an e-text like this one, which was so very clean originally, you should be trumpeting the fact that only a few changes were required. as it is, it's like you're trying very hard to obscure the changes you made... believe me, people do not need _another_ set of meaningless linebreaks. *** so three cheers for the errors that were corrected during the "reposting". now let's go on to analyze the "reposted" version... *** once again, a large number of errors remain in the "reposted" version. first, a 3-dash sequence: > all to a charming young lady---but my opinion is that Miss Trix did not next, we have 6 paragraphing errors: > The couple drew near. Mrs. Mortimer sat with a faint smile on her > http://z-m-l.com/go/fricu/fricup031.html > At the first glance, a puzzled look came into the young man's eyes. He > http://z-m-l.com/go/fricu/fricup032.html > Mary and I understood one another. A kiss would be the seal of our > http://z-m-l.com/go/fricu/fricup048.html > To put it briefly and metaphorically, she whistled her dog back to her > http://z-m-l.com/go/fricu/fricup067.html > It was all planned out; nay, the scene in which the truth as to his own > http://z-m-l.com/go/fricu/fricup099.html > None of these things did the philosopher notice, unless it might be > http://z-m-l.com/go/fricu/fricup153.html and then 9 cases of italicized text incorrectly rendered as uppercase: > "'Whatever you do,' it ran, 'don't recognize me. I am WATCHED. > http://z-m-l.com/go/fricu/fricup042.html > thrown the SPY [poor old Dibbs!] off the scent. -- M.' > http://z-m-l.com/go/fricu/fricup047.html > "Off AND on," added Joe candidly. > http://z-m-l.com/go/fricu/fricup068.html > daughter in the Corn -- oh, it's all RIGHT, Lady Queenborough > http://z-m-l.com/go/fricu/fricup122.html > "I -- I was so terribly afraid of seeming to expect YOU." > http://z-m-l.com/go/fricu/fricup143.html > "Only two?" asked the philosopher. "You see, any number of men MIGHT > http://z-m-l.com/go/fricu/fricup156.html > "Suppose, then, that one of these men was -- oh, AWFULLY in love with > http://z-m-l.com/go/fricu/fricup157.html > "But she's not in -- in love with him, you know. She doesn't REALLY care > http://z-m-l.com/go/fricu/fricup157.html > for him -- MUCH. Do you understand?" -- 157 > http://z-m-l.com/go/fricu/fricup157.html yes, i know this uppercase convention was typical in the early days... but it was replaced -- for good reason -- and we should expect that when the e-text is "reposted", it will be brought to current standards. after all, isn't that the point of the exercise? plus 15 other italics -- french terms (and one newspaper name) -- that are _missing_: > http://z-m-l.com/go/fricu/fricup043.html > http://z-m-l.com/go/fricu/fricup043.html > http://z-m-l.com/go/fricu/fricup046.html > http://z-m-l.com/go/fricu/fricup046.html > http://z-m-l.com/go/fricu/fricup047.html > http://z-m-l.com/go/fricu/fricup048.html > http://z-m-l.com/go/fricu/fricup072.html > http://z-m-l.com/go/fricu/fricup090.html > http://z-m-l.com/go/fricu/fricup116.html > http://z-m-l.com/go/fricu/fricup122.html > http://z-m-l.com/go/fricu/fricup127.html > http://z-m-l.com/go/fricu/fricup139.html > http://z-m-l.com/go/fricu/fricup147.html > http://z-m-l.com/go/fricu/fricup177.html > http://z-m-l.com/go/fricu/fricup178.html so, al... that's about 30 errors still remaining in this "reposted" e-text... a lot... might be more. i didn't run all my tests. but this is enough to make the point. *** it appears that a standard part of the "reposting" process involves the creation of an .html version. i'm all in favor of that. however... the main appeal of an .html version -- to most people, i expect -- is that the text styling will be done. so when it is skipped, like here, that seems to be a breach of the expectations that were implied by the creation of an .html version. another reason commonly given for the .html "requirement" -- and it _is_ often _required_ at d.p. -- is to include illustrations. but as this reposting clearly demonstrates, the illustrations in this book were _not_ included in the .html version. my guess is that most people expect that things like text styling and illustrations _are_ being done, when applicable, during a "reposting". for sure, i expected that they would be. and i'm sure that sometimes they actually _are_... but, as we see here, that is not always the case... being realistic, i could accept that these things aren't done routinely. but that's what we've been waiting for so long -- about 5 years now -- so if it's _not_ being done as a matter of routine, you should _tell_us_, so we make our own plans to do these vitally important improvements. or else streamline the procedure for us to work through whitewashers, so it doesn't take another 13 years to get the errors in this e-text fixed. -bowerbird ************** It's only a deal if it's where you want to go. Find your travel deal here. (http://information.travel.aol.com/deals?ncid=aoltrv00050000000047) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 5 10:01:56 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 5 Sep 2008 13:01:56 EDT Subject: [gutvol-d] frivolous cupid -- 002 Message-ID: i make errors too. i said: > To put it briefly and metaphorically, she whistled her dog back to her > http://z-m-l.com/go/fricu/fricup067.html but it should have been: > To put it briefly and metaphorically, she whistled her dog back to her > http://z-m-l.com/go/fricu/fricup100.html everybody needs a quality-control step behind them... -bowerbird p.s. of course, once you found out the u.r.l. i gave you wasn't correct, you do know that you could have found the passage using the .zml file -- http://z-m-l.com/go/fricu/fricu.zml -- to see what page it was on, and then generated the correct u.r.l. for that page all by yourself, right? i mean, that's a big benefit of this transparent and consistent workflow. ************** It's only a deal if it's where you want to go. Find your travel deal here. (http://information.travel.aol.com/deals?ncid=aoltrv00050000000047) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 5 13:19:51 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 5 Sep 2008 16:19:51 EDT Subject: [gutvol-d] same as it ever was Message-ID: well, both presidential candidates promise change... that's good. we need it. meanwhile, though, back on the internets, things is the same as they've always been... you got adobe, for example, "discontinuing development" of a cool product originally produced by an adobe _competitor_ which adobe "just happened" to have purchased a while back. "flashpaper" this time, from the-company-formerly-known-as macromedia. flash itself adobe has adopted as their platform -- even though, as digital editions shows, their programmers... well... let's just say they haven't quite got the hang of it yet -- but "flashpaper" was much too close to the .pdf bread-winner, so everybody should have known that its future was doomed... > http://uk.techcrunch.com/2008/09/04/startups-in-chaos-as-adobes-flashpaper-discontinues/ buy your competition and bury their version of your product -- it's been the main way adobe has stayed on top all these years... unlike an average monopoly, they don't take ubiquity for granted. and, in other breaking news, sharks eat fish. of any size, while the other fish just eat smaller fish. *** meanwhile, over on teleblawg, "the idiot" (as he is affectionately known to some of us) is hyping a press-release on a sub-$100 e-reader-machine. eventually the price will drop to a cool $50, as he will be sure to tell you. and, in other breaking news, the swallows will leave capistrano this fall... *** david byrne might be getting old -- yeah, got gray hair now, and plays more with powerpoint instead of a real musical instrument -- but around internet tubes, things keep goin' round'n'round... have a nice weekend, folks... :+) -bowerbird ************** Psssst...Have you heard the news? There's a new fashion blog, plus the latest fall trends and hair styles at StyleList.com. (http://www.stylelist.com/trends?ncid=aolsty00050000000014) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hyphen at hyphenologist.co.uk Fri Sep 5 23:29:18 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Sat, 6 Sep 2008 07:29:18 +0100 Subject: [gutvol-d] same as it ever was In-Reply-To: References: Message-ID: <001301c90fe9$e7c8c160$b75a4420$@co.uk> Bowerbird at aol.com wrote >well, both presidential candidates promise change... But we have never had a president, how can we have candidates? Dave Fawthrop (from UK) -------------- next part -------------- An HTML attachment was scrubbed... URL: From grythumn at gmail.com Fri Sep 5 23:41:26 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Sat, 6 Sep 2008 02:41:26 -0400 Subject: [gutvol-d] same as it ever was In-Reply-To: <001301c90fe9$e7c8c160$b75a4420$@co.uk> References: <001301c90fe9$e7c8c160$b75a4420$@co.uk> Message-ID: <15cfa2a50809052341u1471c590s19ce5da0562a6b79@mail.gmail.com> On Sat, Sep 6, 2008 at 2:29 AM, Dave Fawthrop wrote: > But we have never had a president, how can we have candidates? Candied dates are easy.. boil some sugar and cream of tartar until you hit the soft thread stage. Remove from heat, and dip the dates in on a skewer. Stuffing them with peanut butter beforehand adds some contrast. -R C From Bowerbird at aol.com Sun Sep 7 12:48:15 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 7 Sep 2008 15:48:15 EDT Subject: [gutvol-d] september 21 is two weeks from today Message-ID: on 9/21, two weeks from today, i will release "banana cream" to the public. (for some reason, i was confused and said it was september 23rd earlier...) again with the proviso that the flak-givers here don't act up... -bowerbird ************** Psssst...Have you heard the news? There's a new fashion blog, plus the latest fall trends and hair styles at StyleList.com. (http://www.stylelist.com/trends?ncid=aolsty00050000000014) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Sep 8 13:55:22 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 8 Sep 2008 16:55:22 EDT Subject: [gutvol-d] frivolous cupid -- 003 Message-ID: "frivolous cupid" is up, with clean text, and linebreaks that match the p-book. > http://z-m-l.com/go/fricu/fricup123.html i've stressed that this linebreak matching makes it easier for the end-user to verify the accuracy of the text against the scan, but that's just one of the resultant benefits. another is the ability to do "digital reprints" -- a la those created by jose menendez -- where the e-text creates an exact duplicate of the scan-set, but gives the user _control_ over the font used, and creates clean output. the ability to search these "digital reprints", and copy out the text, plus their small size, all while duplicating the look of a scan-set, makes them clearly superior to the scan-set. as you can see here: > http://digital.library.upenn.edu/webbin/bparchive?year=2006& post=2006-02-08,2 i said this: > any time you dump to hard-copy, > you have a book that is "frozen", > so all other things being equal, > you might as well have _your_ "frozen" copy > look like other preexisting "frozen" copies, > i.e., one that matches the p-book pagination. so yes, i've been saying the same thing for 2.5 years now... and i'm more convinced than ever that it's the right course. also in this regard, consider the offerings here: > http://publicdomainreprints.org/ -bowerbird p.s. you can find jose's "digital reprints" at: > http://www.ibiblio.org/ebooks > http://www.ibiblio.org/ebooks/Mabie/ > http://www.ibiblio.org/ebooks/Einstein/Einstein_Relativity.pdf > http://www.ibiblio.org/ebooks/Geronimo/ ************** Psssst...Have you heard the news? There's a new fashion blog, plus the latest fall trends and hair styles at StyleList.com. (http://www.stylelist.com/trends?ncid=aolsty00050000000014) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Mon Sep 8 16:09:02 2008 From: dakretz at gmail.com (don kretz) Date: Mon, 8 Sep 2008 16:09:02 -0700 Subject: [gutvol-d] TwistEd Message-ID: <627d59b80809081609w83a9d3fr3927fab332e84a68@mail.gmail.com> I've posted a newer alpha test version of TwistEd. There is also a file, bird.regex, that contains regular expressions for most of the rules in that recent series of postings. It's still quite rough, but hopefully useful. First, choose a directory containing image files and text files with matching filenames. Then, you should notice another file browser prompt where you can find and load the regex file (Look for the button with the "..." caption.) It will doubtless prove useful to be familiar with Regular Expressions. If you care to build your own regex file, bird.regex should give you enough information about the format. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Sep 9 14:51:16 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 9 Sep 2008 17:51:16 EDT Subject: [gutvol-d] e-ink gets real -- big thin light robust easy Message-ID: after many years of hype and spin, and then having amazon and sony pave the way with the paying public, it appears that e-ink is getting real... > http://www.plasticlogic.com/ > http://www.youtube.com/watch?v=v226DYqlbHQ of course, the cost will still be too high, especially considering it's not full-color, but at least everything else is right now... -bowerbird ************** Psssst...Have you heard the news? There's a new fashion blog, plus the latest fall trends and hair styles at StyleList.com. (http://www.stylelist.com/trends?ncid=aolsty00050000000014) -------------- next part -------------- An HTML attachment was scrubbed... URL: From 1001 at atlanticbb.net Thu Sep 11 12:51:00 2008 From: 1001 at atlanticbb.net (1001 at atlanticbb.net) Date: Thu, 11 Sep 2008 15:51:00 -0400 Subject: [gutvol-d] Fw: Copyright question Message-ID: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> is it rejected this time? nwolcott2 at post.harvard.edu ----- Original Message ----- From: 1001 at atlanticbb.net To: gutvol-d at lists.pglaf.org Sent: Monday, September 08, 2008 12:32 AM Subject: Copyright question Anyone know the copyright status of the pdf file of a public domain book (US)? What about watermarks? (a) can be removed? (b) are irrelevant? (c) include TM acknowldgement? How about the Digital copyright act? How about a copy of a web page describing the pdf, if no TM or copyright notice is on the web page? nwolcott2 at post.harvard.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From creeva at gmail.com Thu Sep 11 13:20:53 2008 From: creeva at gmail.com (creeva at gmail.com) Date: Thu, 11 Sep 2008 16:20:53 -0400 Subject: [gutvol-d] Fw: Copyright question In-Reply-To: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> References: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> Message-ID: <2510ddab0809111320x26624572y473c211ba9a59f82@mail.gmail.com> the text of the pdf would be in tghe clear - the formatting might be. The web page describing it would be under copyright unless there is a notice granting you further rights On 9/11/08, 1001 at atlanticbb.net <1001 at atlanticbb.net> wrote: > is it rejected this time? > nwolcott2 at post.harvard.edu > ----- Original Message ----- > From: 1001 at atlanticbb.net > To: gutvol-d at lists.pglaf.org > Sent: Monday, September 08, 2008 12:32 AM > Subject: Copyright question > > > Anyone know the copyright status of the pdf file of a public domain book > (US)? What about watermarks? (a) can be removed? (b) are irrelevant? (c) > include TM acknowldgement? How about the Digital copyright act? How about a > copy of a web page describing the pdf, if no TM or copyright notice is on > the web page? > nwolcott2 at post.harvard.edu From grythumn at gmail.com Thu Sep 11 13:27:25 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Thu, 11 Sep 2008 16:27:25 -0400 Subject: [gutvol-d] Fw: Copyright question In-Reply-To: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> References: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> Message-ID: <15cfa2a50809111327y5d717ee4rdfb4205124750bf3@mail.gmail.com> Scans of public domain works are still public domain in the US. Modern trademarked or copyrighted material must be omitted before redistribution unless you have authorization to reproduce the modern material. There may be some wrinkles re: collection copyright if a single container has multiple, separate works, as well as possible contractual issues... if you need a definitive answer, talk to a lawyer. For a simple case, say a google or microsoft PDF, I strip all watermarks and modern preambles before processing for DP. They get credited as source, later. As a courtesy, DP does not republish page images (after a work is complete) from online archives without permission, but this is not required by law. R C On Thu, Sep 11, 2008 at 3:51 PM, <1001 at atlanticbb.net> wrote: > is it rejected this time? > nwolcott2 at post.harvard.edu > ----- Original Message ----- > From: 1001 at atlanticbb.net > To: gutvol-d at lists.pglaf.org > Sent: Monday, September 08, 2008 12:32 AM > Subject: Copyright question > Anyone know the copyright status of the pdf file of a public domain book > (US)? What about watermarks? (a) can be removed? (b) are irrelevant? (c) > include TM acknowldgement? How about the Digital copyright act? How about a > copy of a web page describing the pdf, if no TM or copyright notice is on > the web page? > nwolcott2 at post.harvard.edu > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From Bowerbird at aol.com Thu Sep 11 14:18:59 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 11 Sep 2008 17:18:59 EDT Subject: [gutvol-d] electronic-books have finally "arrived" Message-ID: electronic-books have finally "arrived" from major publishers. consider this press-release from amazon.com, announcing biographies about the _wives_ of the presidential candidates: > http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p=irol-newsArticle& ID=1195820 it says: > The Obama biography is available exclusively on Kindle starting today; > the physical book is scheduled to be available later this year. > The McCain biography is scheduled to be available on Kindle this Monday, > September 15; the book will only go to print if John McCain wins the election. look at that again: > the book will only go to print if John McCain wins the election. is that funny, or what? they _know_ that there will be little demand for this book _unless_ the woman will end up living in the white house, so they ain't even gonna bother to _print_ the book unless that (nightmare) happens. in the meantime, though, they'll be happy to take your money for a very-low-cost-to-produce _electronic_version_of_the_book_... that's right, the publishing industry has finally realized how it can _save_ some big bucks via e-books, so they'll now _embrace_ them. just like the computer-makers realized they could _save_money_ by turning their manuals into e-books, instead of paying to print them. you notice _they_ didn't have any resistance to e-books! not one bit! sometimes the brazen greed of the capitalists curdles my stomach... -bowerbird ************** Psssst...Have you heard the news? There's a new fashion blog, plus the latest fall trends and hair styles at StyleList.com. (http://www.stylelist.com/trends?ncid=aolsty00050000000014) -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Thu Sep 11 15:00:21 2008 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Thu, 11 Sep 2008 18:00:21 -0400 Subject: [gutvol-d] What ever heppened to Ibn Battuta? In-Reply-To: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> References: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> Message-ID: <48C994F5.1070506@teksavvy.com> Some time ago, a long time ago, there was a bunch of discussion of "The Travels of Ibn Battuta". There was a PG-DP, then PG-DP-Europe project for it. That was a while ago now and I can't seem to find the completed book. Anyone have any ideas where it went? ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From jayvdb at gmail.com Thu Sep 11 19:07:27 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Fri, 12 Sep 2008 12:07:27 +1000 Subject: [gutvol-d] Fw: Copyright question In-Reply-To: <15cfa2a50809111327y5d717ee4rdfb4205124750bf3@mail.gmail.com> References: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> <15cfa2a50809111327y5d717ee4rdfb4205124750bf3@mail.gmail.com> Message-ID: On Fri, Sep 12, 2008 at 6:27 AM, Robert Cicconetti wrote: > Scans of public domain works are still public domain in the US. Modern > trademarked or copyrighted material must be omitted before > redistribution unless you have authorization to reproduce the modern > material. There may be some wrinkles re: collection copyright if a > single container has multiple, separate works, as well as possible > contractual issues... if you need a definitive answer, talk to a > lawyer. On Wikimedia Commons, the file server for Wikisource, there was a related discussion in February. http://commons.wikimedia.org/wiki/Commons_talk:Licensing/Archive_10#Microsoft_restrictions_on_archive.org_DJVU_files > For a simple case, say a google or microsoft PDF, I strip all > watermarks and modern preambles before processing for DP. They get > credited as source, later. As a courtesy, DP does not republish page > images (after a work is complete) from online archives without > permission, but this is not required by law. On Wikisource, the pagescans are published forever. On the English language Wikisource project, we dont (yet?) regularly clean up the pagescans before pushing them into the system. We probably should - what software are you using to pull out the watermarks? The French Wikisource project puts a lot of work into cleaning up their pagescans before publishing. Here is an example of their work: http://fr.wikisource.org/wiki/Livre:Proust_-_Du_c%C3%B4t%C3%A9_de_chez_Swann.djvu -- John From traverso at posso.dm.unipi.it Thu Sep 11 23:18:06 2008 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Fri, 12 Sep 2008 08:18:06 +0200 (CEST) Subject: [gutvol-d] What ever heppened to Ibn Battuta? In-Reply-To: <48C994F5.1070506@teksavvy.com> (message from Gardner Buchanan on Thu, 11 Sep 2008 18:00:21 -0400) References: <004201c91447$bb6baf80$650fa8c0@atlanticbb.net> <48C994F5.1070506@teksavvy.com> Message-ID: <20080912061806.91F7593DBC@posso.dm.unipi.it> >>>>> "Gardner" == Gardner Buchanan writes: Gardner> Some time ago, a long time ago, there was a bunch of Gardner> discussion of "The Travels of Ibn Battuta". There was a Gardner> PG-DP, then PG-DP-Europe project for it. That was a while Gardner> ago now and I can't seem to find the completed Gardner> book. Anyone have any ideas where it went? The book has completed the rounds at dp-eu, and is currently checked out for post-processing by JulietS. The many snippets in arabic have not been done, I believe that they were supposed to be done offline. Carlo From ebooks at ibiblio.org Fri Sep 12 05:09:45 2008 From: ebooks at ibiblio.org (Jose Menendez) Date: Fri, 12 Sep 2008 08:09:45 -0400 Subject: [gutvol-d] PG's "Peter Pan in Kensington Gardens" Message-ID: <48CA5C09.3010702@ibiblio.org> I see from PG's "posted" mailing list that "Peter Pan in Kensington Gardens" by J. M. Barrie was reposted yesterday. As I pointed out to Michael Hart on the Book People mailing list nearly three years ago, that ebook had some very serious problems. (And the reposted edition still has them.) For one thing, about one-third of it is missing. PG's ebook only has these four chapters: Peter Pan The Thrush's Nest The Little House Lock-out Time But the original book has six chapters: The Grand Tour of the Gardens Peter Pan The Thrush's Nest Lock-Out Time The Little House Peter's Goat As you can see, PG's ebook is missing the first and last chapters. And if that isn't bad enough, the last two chapters in PG's ebook, "The Little House" and "Lock-Out Time" are in reverse order. You needn't take my word for it. The Internet Archive has three scanned old editions online. These two were published in the U.S. by Scribner's: http://www.archive.org/details.php?identifier=peterpaninkensin00barr http://www.archive.org/details.php?identifier=peterpaninkensin00barruoft And this one was published in the U.K. by Hodder & Stoughton: http://www.archive.org/details.php?identifier=peterpaninkensin00barr2 If you check them, you'll see that all three editions have all six chapters in the order I listed. Jose Menendez P.S. I'd retired from making ebooks some time ago, but when I saw that PG's reposted ebook is still incomplete, I made illustrated HTML and PDF versions based on this Scribner's edition at the Internet Archive: http://www.archive.org/details.php?identifier=peterpaninkensin00barr (I would have preferred using the Hodder & Stoughton edition because it was published with 50 color plates vs. only 16 in the Scribner's, but several of the plates were missing from the H & S copy the IA scanned.) If PG would like to finally have the complete text of "Peter Pan in Kensington Gardens," feel free to copy the text from my versions: http://www.ibiblio.org/ebooks/Barrie/ As for the illustrations, you're welcome to copy those as well, but I didn't save them at a very high quality in order to keep the file sizes reasonable. If you think my copies are too grainy, you could extract the illustrations from IA's PDF. From Bowerbird at aol.com Fri Sep 12 11:02:48 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 12 Sep 2008 14:02:48 EDT Subject: [gutvol-d] PG's "Peter Pan in Kensington Gardens" Message-ID: wow. for anyone who'd like to track this down, here was jose's first post: > http://digital.library.upenn.edu/webbin/bparchive?year=2005& post=2005-10-20,10 and here was a follow-up: > http://digital.library.upenn.edu/webbin/bparchive?year=2005& post=2005-10-27,5 you'd think that, upon being told that they had missed _one-third_ of a book, they'd go back immediately and fix that huge problem, wouldn't you? i would. yet this book managed to sit at p.g., undisturbed, for nearly 3 years after that. that's bad enough. but even worse, then it was "reposted", _without_ being fixed! amazing. and embarrassing. somehow, this is starting to remind me of dusty books on a lonely library bookshelf. ones that nobody cares about, because nobody ever uses them... it makes me sad... peter pan is a book about being forever young. but p.g. is looking long in the tooth. -bowerbird p.s. jose's current record: 1 signal post, 1 noise post. thanks for the signal, jose! ************** Psssst...Have you heard the news? There's a new fashion blog, plus the latest fall trends and hair styles at StyleList.com. (http://www.stylelist.com/trends?ncid=aolsty00050000000014) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 12 13:37:46 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 12 Sep 2008 16:37:46 EDT Subject: [gutvol-d] PG's "Peter Pan in Kensington Gardens" Message-ID: ok, so we've got two digitizations, and a clean-up by jose menendez, for "peter pan". meaning i can execute my comparison method and see how well it does on this book. first pass says that less than 10% of the ~3,000 non-blank lines from this book _differ_ between the two digitizations... that's just a tease for the week-end. i'll report the full results on monday... have a nice weekend... -bowerbird ************** Psssst...Have you heard the news? There's a new fashion blog, plus the latest fall trends and hair styles at StyleList.com. (http://www.stylelist.com/trends?ncid=aolsty00050000000014) -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Tue Sep 16 04:30:01 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 16 Sep 2008 13:30:01 +0200 Subject: [gutvol-d] [Fwd: research, unicity, compression ratios, ebooks] Message-ID: <48CF98B9.6040709@perathoner.de> > -------- Original Message -------- > Subject: research, unicity, compression ratios, ebooks > Date: Tue, 16 Sep 2008 04:54:35 +0000 > From: D. Osborn > To: > > Hello, > Can you give me a short list of some of the books that have the lowest > compression ratios? Also highest would be nice as well. While languages > differ in compressability, so do the works of authors. My hypothesis is > that the less a work can be compressed, the better the writing. Only you > have the access to find the most exemplary instances of this. If you > can, thanks! > Dale Osborn Anybody wants to answer? Results of this query attached: gutenberg=> select books.pk, (f1.filesize * 1.0) / f2.filesize as ratio, a.text as title from books, attributes as a, files as f1, files as f2 where a.fk_books = books.pk and a.fk_attriblist = 245 and f1.fk_books = books.pk and f1.fk_filetypes = 'txt' and f1.fk_compressions = 'zip' and f2.fk_books = books.pk and f2.fk_filetypes = 'txt' and f2.fk_compressions = 'none' and f1.fk_encodings = 'us-ascii' and f2.fk_encodings = 'us-ascii' and f1.diskstatus = 0 and f2.diskstatus = 0 and f1.obsoleted = 0 and f2.obsoleted = 0 order by ratio limit 100; -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ratios.txt URL: From jayvdb at gmail.com Tue Sep 16 06:16:12 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Tue, 16 Sep 2008 23:16:12 +1000 Subject: [gutvol-d] [Fwd: research, unicity, compression ratios, ebooks] In-Reply-To: <48CF98B9.6040709@perathoner.de> References: <48CF98B9.6040709@perathoner.de> Message-ID: On Tue, Sep 16, 2008 at 9:30 PM, Marcello Perathoner wrote: >> -------- Original Message -------- >> Subject: research, unicity, compression ratios, ebooks >> Date: Tue, 16 Sep 2008 04:54:35 +0000 >> From: D. Osborn >> To: >> >> Hello, >> Can you give me a short list of some of the books that have the lowest >> compression ratios? Also highest would be nice as well. While languages >> differ in compressability, so do the works of authors. My hypothesis is that >> the less a work can be compressed, the better the writing. Only you have the >> access to find the most exemplary instances of this. If you can, thanks! >> Dale Osborn > > Anybody wants to answer? I have nothing to say about the measurement of writing. I am sure people have tried to determine what "better writing" consists of, or worse yet quantify it, but I'll have no bar of it. Encoding is achieved using an algorithm, and each algorithm is different. >From an information theory perspective, a piece of communication (written or otherwise) can be seen as a stream of informative bits which can be measured, giving a length. For writing, the number of characters and spaces is the length. If this sequence can be losslessly encoded to a smaller length, the compressed length is a more accurate estimate of the amount of "information" than the uncompressed length. i.e. When I can unencode the compressed sequence to return the original, the uncompressed sequence had superfluous noise in it. Y__ kn_w h_w th_ v_w_ls d_nt r__lly m_tt_r? A_d n_i_h_r d_ t_e c_n_o_a_t_... Did you have problems with "consonants"? You can do the above with nearly any English sentence, but some are more difficult, especially when they contain big words with particular letters missing. But it isnt the size of the words that matters. For example I am sure most will be able to pick this word if you think about it. s____c___f____l_____e____l_d_____s English has a low signal to noise ratio, or entropy http://www.everything2.net/e2node/entropy%2520of%2520English So, assuming that the compression algorithm used is _ideal_, the hypothesis is (stated another way) that the entropy of pieces of English text can be used to infer value, aesthetics, or `something'. This is the field of computational linguistics, and I havent heard that they are approaching that objective. http://en.wikipedia.org/wiki/Computational_linguistics It is however easy to computationally assess a work to be "similar" to others. http://www.everything2.net/e2node/Using%2520gzip%2520to%2520do%2520computational%2520linguistics > Results of this query attached: > > gutenberg=> select books.pk, (f1.filesize * 1.0) / f2.filesize as ratio, > a.text as title from books, attributes as a, files as f1, files as f2 where > a.fk_books = books.pk and a.fk_attriblist = 245 and f1.fk_books = books.pk > and f1.fk_filetypes = 'txt' and f1.fk_compressions = 'zip' and f2.fk_books = > books.pk and f2.fk_filetypes = 'txt' and f2.fk_compressions = 'none' and > f1.fk_encodings = 'us-ascii' and f2.fk_encodings = 'us-ascii' and > f1.diskstatus = 0 and f2.diskstatus = 0 and f1.obsoleted = 0 and > f2.obsoleted = 0 order by ratio limit 100; I assume this is looking at the size of the 'zip' file. A zip file is container format rather than an encoding method. Tools that create zip files typical employ the DEFLATE algorithm, which in turn uses Huffman coding, itself an an umbrella for a few different algorithms differentiated primarily on how the mathematics are used to best advantage for the time vs space requirements. (the "optimal" encoding is when the output consumes least space, or its information entropy is highest, but that could come at the cost of taking a very long time to compute) http://en.wikipedia.org/wiki/Huffman_coding Zip tools usually have a few knobs that can be used to tweak whether time or space is important, and those chooses play a large part in how well the encoding goes. I am guessing that unicode causes a major problem for the default Zip settings, as the defaults are designed to work well for typical ASCII usage, or at least they used to be. Only if all Gutenberg texts were constructed using the same algorithmic choices would the results of a database query have any meaning. If the software used to create the zip files has changed, then it is probable that this also makes the result less meaningful. Also, the gutenberg etext format introduces its own peculiarities, especially a header - a big header - that will likely influence the construction of the encoding tables used to encode the file (unless the encoding is performed in chunks, so that an encoding table used to construct the first 1Kb doesnt carry over to determine how the second 1Kb is encoded, which would result in the encoding routine being predisposed to prefer works like the Gutenberg header, which is not my idea of an enjoyable read.). HTH, either the original person, or spur others to correct me where I have over simplified or simply not recalled my theory as well as I should. -- John V From marcello at perathoner.de Tue Sep 16 13:10:27 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 16 Sep 2008 22:10:27 +0200 Subject: [gutvol-d] [Fwd: research, unicity, compression ratios, ebooks] In-Reply-To: References: <48CF98B9.6040709@perathoner.de> Message-ID: <48D012B3.7060607@perathoner.de> John Vandenberg wrote: > I assume this is looking at the size of the 'zip' file. Yes. But with all shortcomings of this method you still get the result that - religion and politics compress best and - poetry compresses worst, about as well as pi. > It is however easy to computationally assess a work to be "similar" > to others. > > http://www.everything2.net/e2node/Using%2520gzip%2520to%2520do%2520computational%2520linguistics Thanks for that link. It is very good to read that page until you get to: > You puerile, tiny souls, listen again: Most na?ve algorithms can > accurately classify easy data sets, even one as na?ve as yours. You > can't just test on a toy domain of your choosing and declare > yourselves smart, it doesn't work like that Yes! Yes! Yes! He should have said that to Bowerbird, who insists that he can markup most of PG with his toy language. (And `proves' this outrageous assertion by marking up no less than 3 dime novels.) Marcello From Bowerbird at aol.com Tue Sep 16 14:29:25 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 16 Sep 2008 17:29:25 EDT Subject: [gutvol-d] some feedback on dkretz's excellent "twisted" program (v0.d) Message-ID: here's some feedback for dkretz on twister-version(d), which is now available over at the google-page for it: > http://code.google.com/p/dp50/downloads/list overall, it's shaping up quite nicely... great job, don! *** small bug... sometimes the picture seems to "flutter". it stops if resized... *** questions... i've not found much use for thumbnails, but maybe some do? is the "other" tab working yet? and "pin page" means... what? *** quite impressive... the "hunter" tab routines have been coming along quite nicely. the ability to zoom along from one error to another is _handy_, which is -- of course -- what i've been telling people all along... ...however, the f3 button will not work for "next" on the mac... the way the prev/next/replace buttons move is disconcerting. *** the weakness... the main weakness, however, is what should be the _strength_, namely the _inteface_ and the ability to cruise through the book. in this regard, a number of tips are offered... don't need "twister" at the top of the window; it's in the titlebar. then move the location-field up to the very top of the window. move the left and right fields up, so as to increase their height. we want 'em big! move the tabs right under the location-field. you can also move the buttons at the bottom down a little bit... might seem like a small thing, but i can assure you that it's not. you need to make the app responsive to left/right cursor-keys, since that's the major way that people prefer to change pages... i would think that the "font" and "fontsize" buttons would apply to _all_ the text that gets loaded, but it doesn't seem to do so... it only gets applied to text that's _selected_, which is a bit weird, since the logical end-point of this text is that it'll be _plain_ text, meaning that the font at one point in time is totally meaningless. which also makes me wonder what bold/italic/underline mean, unless these styling choices will be transformed into _markup_. (bold and italic don't seem to be working, although underline is.) *** all in all, though, none of the problems should be difficult to fix, and this tool is coming along very quickly. congratulations don! -bowerbird ************** Psssst...Have you heard the news? There's a new fashion blog, plus the latest fall trends and hair styles at StyleList.com. (http://www.stylelist.com/trends?ncid=aolsty00050000000014) -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Wed Sep 17 14:39:17 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Wed, 17 Sep 2008 23:39:17 +0200 Subject: [gutvol-d] [Fwd: research, unicity, compression ratios, ebooks] In-Reply-To: <48D012B3.7060607@perathoner.de> References: <48CF98B9.6040709@perathoner.de> <48D012B3.7060607@perathoner.de> Message-ID: Hi Marcello, I is quite natural that religious and political texts compress best. It is a matter of style and the nature of the text. But, If you look at older texts you will find that they equally compress worst. The best way to compress a text is to do it morphological. It is also the best way of determing the entropy of a text and or a language. You ought to really, learn to read and interpret quotes!! What is meant here by "toy domain" is refering to corpus size or for that matter sample size. The author is not talking about a method or language. regards Keith. Am 16.09.2008 um 22:10 schrieb Marcello Perathoner: > John Vandenberg wrote: > >> I assume this is looking at the size of the 'zip' file. > > Yes. But with all shortcomings of this method you still get the > result that > > - religion and politics compress best and > > - poetry compresses worst, about as well as pi. > > >> It is however easy to computationally assess a work to be "similar" >> to others. >> http://www.everything2.net/e2node/Using%2520gzip%2520to%2520do% >> 2520computational%2520linguistics > > Thanks for that link. It is very good to read that page until you > get to: > >> You puerile, tiny souls, listen again: Most na?ve algorithms can >> accurately classify easy data sets, even one as na?ve as yours. You >> can't just test on a toy domain of your choosing and declare >> yourselves smart, it doesn't work like that > > Yes! Yes! Yes! He should have said that to Bowerbird, who insists > that he can markup most of PG with his toy language. (And `proves' > this outrageous assertion by marking up no less than 3 dime novels.) > > > > Marcello > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From dakretz at gmail.com Wed Sep 17 15:28:14 2008 From: dakretz at gmail.com (don kretz) Date: Wed, 17 Sep 2008 15:28:14 -0700 Subject: [gutvol-d] Feedback.... Message-ID: <627d59b80809171528m31e2b1cr315bbf56c4d73101@mail.gmail.com> Thanks for the review. As you imply,most of it is stuff I'm already aware of, but haven't gotten to yet. The "Pin Page" button is to limit the work to the current page (intensive editing) rather than the entire book (extensive editing.) I'm following the principle "Make it work, then make it pretty," and what I'm thinking through is the overall workflow. If you accept the assertion that the current dp process is suboptimal, then we can think about what needs improvement, why, and how to accomplish it. I start with an established set of regexes - text patterns - that indicate a probable error. The tool will, as a first pass, go through the entire pattern list and show how many instances of each pattern it can detect - it takes a minute or two to do so.. From this, I can quickly see what the most egregious OCR faults are in this project. Then I can pick one and quickly skip through and make corrections (using a likely fix if one suggests itself, as is usually the case.) Usually, I will quickly start to notice other errors of the same and different types that I can fix at the same time. Then, looking for feedback opportunities, I'll probably detect other error patterns that could be looked for in the project, and I'll add them to the bottom of the list. One obvious measure of the effectiveness of this approach is the rate of error detection and correction. Especially since some errors can be repaired all at once without visual confirmation - as when the pattern "1,ooo" is seen, because the OCR frequently confuses ohs with zeroes. To conceptualize this for a moment: A generally accepted principle of design as it applies to many kinds of systems is that, to encourage rapid improvement, feedback loops need to be implemented, and then jiggered to increase their quantity, quality, and rapidity of application. So what I'm doing (and what happens in BBird's process as well, I imagine), is that I'm starting out with a set of existing feedback rules; I'm applying them extensively (throughout the document) and intensively (all roughly simultaneously,) And I'm quickly discovering new feedback rules - which I then apply, and also add to my list of rules for future use. The current dp process has (unintentionally) subverted almost all useful forms of feedback, as. I'm reminded of that reppeatedly as I use this tool to patch up an OCR document. -- There's no list of error patterns to ckeck, even manually. -- There's no automatic application of patterns, except for the very limited "wordcheck" feature. -- Even that feature is ever only applied one page at a time. -- So any possible extensive checking is unavailable. I think it wouldn't be unrealistic to find that anyone making use of a similar extensive, intensive, feedback-based tool would find and correct, in less than an hour, as many errors as now are corrected in dp in two or three "Rounds" - which involves maybe 300 pages at 5 minutes per page x 2.5 avg rounds = 15 person-hours, extending over somewhere between a month and 18 months. And even that process appears to have reached the limit of its scalability. There hasn't been an appreciable increase in throughput in what, maybe two or three years. So maybe it would be useful to discuss alternate workflows making use of improved (and incrementally improvable) feedback mechanisms. This is not to knock the dp model. It's provided a lot of pd text so far, and continues to chug along. But it wouldn't be reasonable to think that the system designers would have gotten it right the first time. Unfortunately, what now exists, seems not easily amenable to sustantive reconsideration, evaluation, or modification. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Wed Sep 17 15:36:04 2008 From: dakretz at gmail.com (don kretz) Date: Wed, 17 Sep 2008 15:36:04 -0700 Subject: [gutvol-d] Feedback... Message-ID: <627d59b80809171536j68e36b65ta816fa2206a420d8@mail.gmail.com> Er... make that - what - 45 person-hours? -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Wed Sep 17 16:10:52 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 18 Sep 2008 01:10:52 +0200 Subject: [gutvol-d] [Fwd: research, unicity, compression ratios, ebooks] In-Reply-To: References: <48CF98B9.6040709@perathoner.de> <48D012B3.7060607@perathoner.de> Message-ID: <48D18E7C.3040300@perathoner.de> Schultz Keith J. wrote: > You ought to really, learn to read and interpret > quotes!! > > What is meant here by "toy domain" is refering to > corpus size or for that matter sample size. The author > is not talking about a method or language. Huh? And what did I say? I said that Bowerbird did mark up "no less than 3 dime novels". The "3 dime novels" are Bowerbird's "toy domain". >>> You puerile, tiny souls, listen again: Most na?ve algorithms can >>> accurately classify easy data sets, even one as na?ve as yours. You >>> can't just test on a toy domain of your choosing and declare >>> yourselves smart, it doesn't work like that >> >> Yes! Yes! Yes! He should have said that to Bowerbird, who insists that >> he can markup most of PG with his toy language. (And `proves' this >> outrageous assertion by marking up no less than 3 dime novels.) From ebooks at ibiblio.org Thu Sep 18 13:56:13 2008 From: ebooks at ibiblio.org (Jose Menendez) Date: Thu, 18 Sep 2008 16:56:13 -0400 (EDT) Subject: [gutvol-d] PG's "Peter Pan in Kensington Gardens" Message-ID: On Sept. 12, 2008, Bowerbird wrote: > ok, so we've got two digitizations, and a clean-up by jose menendez, for > "peter pan". > > meaning i can execute my comparison method and see how well it does on > this book. > > first pass says that less than 10% of the ~3,000 non-blank lines from > this book _differ_ > between the two digitizations... > > that's just a tease for the week-end. i'll report the full results on > monday... > > have a nice weekend... Well, it's been nearly a week since Bowerbird posted that message. I'm surprised that he's running late with those full results. I was looking forward to seeing whether he'd catch the errors in the old Scribner's edition that I left uncorrected in my HTML and PDF versions in order to test his error detecting skills. For example, there's this line on page 13: from the sheeps' shoulders and they look That "sheeps'" should obviously be "sheep's." For another example, there are these two lines on page 17: our gate are the Dog's Cemetery and the what the Dog's Cemetery is, as Porthos is Both of those "Dog's" should be "Dogs'." I'm sure Bowerbird would have reported those errors when he eventually got around to posting the full results of his comparison. :) Jose Menendez From Bowerbird at aol.com Thu Sep 18 14:10:39 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Sep 2008 17:10:39 EDT Subject: [gutvol-d] PG's "Peter Pan in Kensington Gardens" Message-ID: thanks for giving people notice about those errors, jose. but i still have to count this post as "noise", not "signal", because, well... what's your point? you don't have one... other than that i'm "running late"... but what's the big deal about that? no one here cares anyway, do they? anyway, not more than 15 minutes before your post, i uploaded this little baby, for anyone who does care: > http://z-m-l.com/ppakg/ppakg-diff01.html i compared two different digitizations of this book... the lines that _matched_ are shown in _black_... as you can see, it's the vast majority of the book. lines that _differed_ are shown in red and blue... i display the scan so you can see which is correct, and check the checkbox by the one that's correct. if neither is correct, you'd check the bottom box... the form isn't active, so your "votes" aren't counted, so don't actually bother to mark those boxes now... i was just showing people how the thing will work... -bowerbird p.s. jose's record so far: noise=2, signal=1. c'mon jose. ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ebooks at ibiblio.org Thu Sep 18 14:59:20 2008 From: ebooks at ibiblio.org (Jose Menendez) Date: Thu, 18 Sep 2008 17:59:20 -0400 (EDT) Subject: [gutvol-d] PG's "Peter Pan in Kensington Gardens" In-Reply-To: References: Message-ID: Bowerbird wrote: > thanks for giving people notice about those errors, jose. > > but i still have to count this post as "noise", not "signal", > because, well... what's your point? you don't have one... > > other than that i'm "running late"... > > but what's the big deal about that? > no one here cares anyway, do they? I posted that message because it occurred to me that if you did find those errors from the original book, you'd probably crow about how you found errors that I'd missed. I doubt you would have believed me then if I'd said that I left them in on purpose. :) > anyway, not more than 15 minutes before your post, > i uploaded this little baby, for anyone who does care: >> http://z-m-l.com/ppakg/ppakg-diff01.html > > i compared two different digitizations of this book... The two Scribner's editions that I linked to at the Internet Archive? > the lines that _matched_ are shown in _black_... > as you can see, it's the vast majority of the book. > > lines that _differed_ are shown in red and blue... > i display the scan so you can see which is correct, > and check the checkbox by the one that's correct. > if neither is correct, you'd check the bottom box... > > the form isn't active, so your "votes" aren't counted, > so don't actually bother to mark those boxes now... > i was just showing people how the thing will work... What I find more interesting is the index of the directory that file is in: http://z-m-l.com/ppakg/ Did you intentionally put all the illustrations in the wrong places, or did you just goof up? Take a look at this scan at the Internet Archive of the list of illustrations: http://ia300011.us.archive.org/zipview.php?zip=/2/items/peterpaninkensin00barr/peterpaninkensin00barr_flippy.zip&file=0015.jpg Note that the illustrations follow even-numbered pages, e.g. 2, 16, 24, 28, etc. You have them following the next odd-numbered pages, e.g. 3, 17, 25, 29, etc. You also have the table of contents scan on an even-numbered page (it should be odd-numbered) facing the list of illustrations scan. http://z-m-l.com/ppakg/ppakgf002.jpg http://z-m-l.com/ppakg/ppakgf003.jpg There may be more errors, but I don't have time to check now. Jose Menendez > p.s. jose's record so far: noise=2, signal=1. c'mon jose. P.S. Apparently, Bowerbird thinks that I care what he thinks of my posts. I don't. :) From Bowerbird at aol.com Thu Sep 18 15:42:52 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Sep 2008 18:42:52 EDT Subject: [gutvol-d] PG's "Peter Pan in Kensington Gardens" Message-ID: jose's record: noise=3, signal=1 *** jose said: > I posted that message because it occurred to me that > if you did find those errors from the original book, > you'd probably crow about how you found errors > that I'd missed. I doubt you would have believed me > then if I'd said that I left them in on purpose. :) who cares about silly stuff like that? i care about correcting the errors... of course i gave jon noring a bad time about it, because he was very busy giving p.g. a bad time, talking nonsense and bullshit and spin and lies. but he _never_ produced any _lists_ of p.g. errors. and meanwhile, he couldn't find simple errors in his own work, so his hypocrisy was just ridiculous. on the other hand, you've proven you can find errors. and although you often engage in bullshit just to bug michael, you occasionally post a worthwhile message. it would be counterproductive for me to try to rile you. (besides, it seems i don't need to do anything to do it.) > The two Scribner's editions that I linked to at the Internet Archive? i can't remember. why do you care? and, if you care enough, track it down. > What I find more interesting of course you don't want to talk about what's _really_ important. you just want to get digs in. do you think people care about that? why don't you talk about something that's _meaningful_ instead? you've got a good head on you, but you insist on being _petty_... > What I find more interesting > is the index of the directory that file is in: > http://z-m-l.com/ppakg/ notice anything different about that directory? maybe not... it's not under the "go" directory, like most of my other books. that's because it's not ready for criticism yet, it's in transition... > Did you intentionally put all the illustrations in the wrong places, > or did you just goof up? neither. i simply didn't care where they went. it's a work-product, meant to correct the _text_, so illustrations have zero importance... the only reason i included them was to avoid the work of culling 'em. > Note that the illustrations follow even-numbered pages, > e.g. 2, 16, 24, 28, etc. > You have them following the next odd-numbered pages, > e.g. 3, 17, 25, 29, etc. who cares? at any rate, i'm gonna rework those illustrations anyway... there's no sense in having each illustration use up 4 pages in an e-book, with 2 of the pages being blank back-sides... so i'll just put the caption across from the illustration itself. when i'm done, you can bet the recto/verso will be correct... > You also have the table of contents scan on an even-numbered page > (it should be odd-numbered) facing the list of illustrations scan. > http://z-m-l.com/ppakg/ppakgf002.jpg > http://z-m-l.com/ppakg/ppakgf003.jpg that i did on purpose... i usually rearrange forward-matter pages, and i always put a table-of-contents page on page c002 and f002. sometimes it'll be one that i create, sometimes one from the book. (if the one in the book doesn't have a roman-numeral pagenumber, i'll use it in the 002 page-slots. if it does, i'll leave it where it was, and create a page. the idea is to have a contents-page in a known location that's also extremely convenient, just one page away from the titlepage.) > There may be more errors, but I don't have time to check now. i would suggest that you focus on the important stuff... jose's record: noise=3, signal=1 -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Sep 19 13:14:48 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Sep 2008 16:14:48 EDT Subject: [gutvol-d] peter pan at kensington gardens -- 001 Message-ID: ok, well, i guess jose wants me to get started on this, so let's go... for several years now, i've argued that the best way to correct o.c.r. is to compare two different digitizations of the same book, where you assume the lines that are the _same_ in both digitizations are _correct_, and focus in on the lines that are _different_ to determine which of the two is incorrect (or if both are), and then you fix _that_. this methodology -- if it works -- is far more efficient than the "old" method (which d.p. uses) of comparing every o.c.r. word to the scan, because you only examine a few of the lines (typically, about 5-8%). i've already documented, with solid research, that this method does work, and works well, without bad side-effects. but let's try it again. maybe if we demonstrate it enough times, d.p. will get the hint... (that's a joke, folks. the "powers that be" at d.p. are brain-dead.) *** i'm comparing the o.c.r. output from archive.org on two copies of "peter pan at kensington gardens". the two digitizations look to be of the exact same version, or very close, as the linebreaks match up. now, you might expect that -- given that the books are the same -- the o.c.r. output would be the same as well, at least for the most part. that is, any o.c.r. errors in one book would be mirrored in the other... o.c.a. uses the same scanning machines at different libraries, and the same o.c.r. program, so such an expectation is quite understandable. so the first test is whether this expectation holds up. if it does, then we might have a problem where a specific line from both digitizations is _incorrect_, and _incorrect_in_the_same_way_. this means the lines would _match_ -- so we wouldn't examine 'em, which means that we would miss the error. this would be a problem. and if it happens on _lots_ of lines in the book, it'd be a big problem. my prior experiments have shown that this problem does not emerge, not on anything more than a trivial basis. but let's crank it once again. of course, once we have done the comparison, we need a _clean_text_ to use as our criterion. in this case, we've got the text that jose made. jose is very good at turning out an extremely clean text. so let's do it... we'll correct the two texts against each other, then compare to jose's, and see how many errors our comparison method would have missed. -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Sep 22 14:31:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Sep 2008 17:31:42 EDT Subject: [gutvol-d] "banana cream" is available Message-ID: happy autumn! "banana cream" -- one of my proofing programs -- is available... i'm still applying a little bit of polish, so it's not posted online yet, but if anyone is itching to see it, just shoot me a backchannel and i'll route a review copy to you. -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Mon Sep 22 16:32:44 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 23 Sep 2008 01:32:44 +0200 Subject: [gutvol-d] "banana cream" is available In-Reply-To: References: Message-ID: <48D82B1C.3030601@perathoner.de> Bowerbird at aol.com wrote: > "banana cream" -- one of my proofing programs -- is available... > > i'm still applying a little bit of polish, Banana cream with polish? Urgh! Must taste like pudding with proof. From Bowerbird at aol.com Tue Sep 23 16:37:06 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Sep 2008 19:37:06 EDT Subject: [gutvol-d] peter pan in kensington gardens -- 002 Message-ID: we're resolving two digitizations of "peter pan in kensington gardens", to see how the resultant file will compare with jose's text, our criterion. here are the changes i made _automatically_, without checking each: first, close up floating punctuation: > s=replaceall(s," .",".") > s=replaceall(s," ,",",") > s=replaceall(s," ;",";") > s=replaceall(s," :",":") > s=replaceall(s," ?","?") > s=replaceall(s," !","!") > s=replaceall(s,chr(10)+"' ",chr(10)+"'") next, remove garbage characters: > s=replaceall(s,"<","'") next, correct mistaken quotemarks at the beginning of a line: > s=replaceall(s,chr(10)+"1 ",chr(10)+"'") > s=replaceall(s,chr(10)+"4 ",chr(10)+"'") > s=replaceall(s,chr(10)+"6 ",chr(10)+"'") there are probably more such global changes i'll do, after having examined the diffs between the files, but that's a good start to kick the process off... -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Sep 23 17:31:04 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Sep 2008 20:31:04 EDT Subject: [gutvol-d] peter pan in kensington gardens -- 003 Message-ID: we're resolving two digitizations of "peter pan in kensington gardens", to see how the resultant file will compare with jose's text, our criterion. as i pointed out earlier, here are the diffs between the two digitizations: > http://z-m-l.com/ppakg/ppakg-diff01.html with 206 diffs in the body of the text -- on some 3,000 non-blank lines -- you can see how this method gives us _much_ greater proofing efficiency... i said: > there are probably more such global changes i will do, > after having examined the diffs between the files, > but that's a good start for now... we see that a check for _numbers_ will be very productive, since many quotemarks were misrecognized as numbers... we also find a lot of spacey-quotes. this book, published in england, uses single-quote marks for dialog, which is a bit more convoluted to correct than double-quote marks. but it's still quite doable, so i'll fix those and then re-do the diffs. -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Sep 29 01:45:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 29 Sep 2008 04:45:28 EDT Subject: [gutvol-d] "jean of the lazy a" -- 001 Message-ID: ok, the next step in the chain here is to manage the process of mashing up project gutenberg e-texts and archive.org e-texts. archive.org has the scan-sets that will back the digital text, and project gutenberg has fairly clean text that can be used to proof o.c.r. from archive.org to bring it up to a very nice state of being. so once again i focus on a p.g. e-text that was recently "reposted". this one is by b.m. bower, #538, and it's titled "jean of the lazy a". > http://www.gutenberg.org/files/538/538.txt meanwhile, at archive.org, it's known as "jeanoflazy00boweiala": > http://www.archive.org/details/jeanoflazy00boweiala *** our first step will be to reconcile the paragraphing between them. i found 5 errors in the paragraphing of the reposted p.g. e-text... *** these 2 "paragraphs" were wrongly introduced in the p.g. e-text: > "Don't -- oh, it looks as if you were picking up a > http://z-m-l.com/go/jeana/jeanap136.jpg > She registered another sob which the camera > http://z-m-l.com/go/jeana/jeanap136.jpg *** and these 3 paragraphs were incorrectly missed in the p.g. e-text: > Jean was startled, but she did not lower her gun > http://z-m-l.com/go/jeana/jeanap058.jpg > She stopped, and Burns turned his eyes involuntarily > http://z-m-l.com/go/jeana/jeanap214.jpg > Jean began to feel a certain confusion. > http://z-m-l.com/go/jeana/jeanap284.jpg *** coordinating the paragraphs like this facilitates the _re-introduction_ of the p-book linebreaks in the archive.org o.c.r. back into the p.g. e-text, which makes a comparison of the two digitizations much simpler to do... -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Sep 29 13:38:36 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 29 Sep 2008 16:38:36 EDT Subject: [gutvol-d] "jean of the lazy a" -- 002 Message-ID: we're resolving the p.g./archive.org differences in "jean of the lazy a". > http://www.gutenberg.org/files/538/538.txt > http://www.archive.org/details/jeanoflazy00boweiala *** i've already noted the 5 paragraphing errors that occur in the "reposted" version of this e-text. but there were a few others. *** in the "reposted" e-text, this line -- which had been inadvertently dropped from the e-text as first posted -- was correctly restored: > You're going to tell me I'm in bad. But I can't > http://z-m-l.com/go/jeana/jeanap221.jpg however, as you can see by looking at the scan, the single-quotes around "in bad" were not restored. it's ironic that a relatively high percentage of "applied corrections" involve an error in themselves, but anyone who has done this type of work knows that it's the case, since it happens to all of us... we're human, but we must carry on... *** in addition, this error was missed in the reposted version: > Was it pose Was the girl phlegmatic,--with that face which was so alive > Was it pose? Was the girl phlegmatic,--with that face which was so alive > http://z-m-l.com/go/jeana/jeanap247.jpg *** finally, as we have come to expect now, text _styling_ (e.g., italics) was _not_ repaired in this reposted text, nor was the frontispiece included. *** these are the errors which have come to my attention in the process of prepping the files for the actual _comparison_, which might reveal even more errors in the p.g. e-text. but the point is not those errors per se... the point is that a _comparison_ of two digitizations can help find errors, and -- more specifically -- how clean p.g. e-texts can be used to correct o.c.r. errors in e-texts from archive.org. -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Sep 30 02:07:33 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 30 Sep 2008 05:07:33 EDT Subject: [gutvol-d] "jean of the lazy a" -- 003 Message-ID: we're resolving the p.g./archive.org differences in "jean of the lazy a". > http://www.gutenberg.org/files/538/538.txt > http://www.archive.org/details/jeanoflazy00boweiala *** i have listed the _errors_ that persisted in the "reposted" e-text so i should also acknowledge the _fixes_ done during the "reposting". > spacey ellipses were closed up > em-dashes were "clothed" > end-line hyphenates were rejoined > these incorrect hyphenates were fixed: > > work-aday > > alto-gether > > bur-rowing > > thou-sands > > conver-sation > a he/be error was fixed: > "You better not," be warned. > "You better not," he warned. > an incorrect comma was deleted: > the, > the > a spacey quotemark was closed up: > in? " > in?" > 3 paragraph-terminating periods were inserted: > room(.) > picture(.) > questioning(.) so it's not as if they aren't doing good work on these "repostings", it's just that they're not getting out all of the glitches that they can. *** all in all, as we have documented many times, another _very_early_e-text_ -- this one numbered #538 -- proves to have been remarkably well-done. if we disregard the "bureaucratic" changes -- like "clothing" em-dashes -- and we assume that the comparison won't turn up much more than this, there were less than 32 errors in this 320-page book, which means that it had already attained -- in 1997 -- my oft-repeated standard for a book to go to the public for "continuous proofing", i.e., 1-error-every-10-pages. the people who did these early books should be praised for their quality... -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Sep 30 10:04:54 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 30 Sep 2008 13:04:54 EDT Subject: [gutvol-d] "jean of the lazy a" -- 004 Message-ID: we're resolving the p.g./archive.org differences in "jean of the lazy a". > http://www.gutenberg.org/files/538/538.txt > http://www.archive.org/details/jeanoflazy00boweiala *** the next step is to rebreak the p.g. e-text, restoring the linebreaks from the original p-book, so that the digital text and the scan-set will be in sync, and can therefore attain the most possible synergy. having done this, i can now also post a "first draft" of the book: > http://z-m-l.com/go/jeana/jeanap123.html this is still in-progress, because i have to restore the hyphens to the end-line-hyphenates created by the rebreaking process, but it's a good start. after this, i will do the actual _comparison_ between the p.g. text and the o.c.r., to see if i can turn up more errors in the p.g. text... but perhaps now that i have posted the book, jose menendez will do that grunt-work for me... :+) *** i make all of this sound very easy. and -- eventually -- it will be. but i am having to do quite a bit of work to design and code the tools that will make it a simple process. but hey, that's why they pay me the big bucks... ;+) *** at any rate, here are 2 more "reposting" errors i stumbled across: > CHAPTER IV. > CHAPTER IV > http://z-m-l.com/go/jeana/jeanap042.html > PUNCH VERSES PRESTIGE > PUNCH VERSUS PRESTIGE > http://z-m-l.com/go/jeana/jeanap177.html -bowerbird ************** Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: