From ag737 at freenet.carleton.ca Tue Jan 1 11:45:20 2008 From: ag737 at freenet.carleton.ca (Wallace J.McLean) Date: Tue, 01 Jan 2008 14:45:20 -0500 Subject: [gutvol-d] Happy Public Domain Day! Message-ID: <1bba7e1ba8e4.1ba8e41bba7e@ncf.ca> http://www.copyrightwatch.ca/?p=49 From Bowerbird at aol.com Wed Jan 2 19:29:14 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Jan 2008 22:29:14 EST Subject: [gutvol-d] happy Message-ID: please. happy new year. thank you. -bowerbird ************************************** See AOL's top rated recipes (http://food.aol.com/top-rated-recipes?NCID=aoltop00030000000004) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080102/fd364031/attachment.htm From Bowerbird at aol.com Fri Jan 4 10:27:12 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Jan 2008 13:27:12 EST Subject: [gutvol-d] moby dick -- a report on the state of the art of digitization Message-ID: ok, i've told you recently that o.c.r. from the o.c.a. is good. and that it can be improved by post-o.c.r. clean-up programs. and then improved even further by comparison with an existing digitization, if one should exist, to the point it can be _finished_, quickly, even largely automatically. *** here's a report in support of that... i examined the o.c.a. first volume of "moby dick": > http://www.archive.org/details/mobydickorwhale01melvuoft i did an initial comparison of their o.c.r. with the e-text from project gutenberg: > http://www.gutenberg.org/etext/2489 it didn't take long to determine that the p.g. e-text was from a different edition than the one which the o.c.a. scanned... the first tip-off was that the o.c.a. edition used british spellings, not the american ones which are there in the p.g. e-text. this brings up a good point to consider when we're talking about _comparison_ as a strategy for correcting o.c.r. text... specifically, there are a number of things that will cause superfluous "differences" that need to be ignored in comparisons, like american/british spelling variants... you don't want these differences flagged. also -- as is typical with british editions -- there was also a difference in quotemarks; the british use single-quotemarks as default, and nested quotes use double-quotemarks. american editions use double-quotemarks as the default, of course, with any internal quotes signified by single-quotemarks... in addition, one of the differences that you will frequently find between editions involves punctuation (especially colons and semicolons, as well as various takes on hyphenated words), and this is particularly true when one edition is from a british publisher, the other an american. there were other superficial differences between these two texts. as a quick list: 1. chapter numbers (roman versus arabic) 2. headings (all-upper versus mixed-case) 3. chapter initial capped words (versus not) 4. block indents (i.e., the o.c.a. text had none) finally, one of the biggest complicating factors on comparing these e-texts is due to a massive _incompetence_ in the o.c.a. workflow, namely that they lose all the em-dashes in their text... that's right, you heard me correctly. they lose all the em-dashes in their o.c.r. text. some stupid person somewhere has evidently mis-set some toggle, discarding em-dashes... a glitch this big is ridiculously unforgivable... my mind is just boggled that they could even _make_ such a stupid mistake. but they did... even worse, i have tried -- tried repeatedly -- to bring it to their attention. yet it persists... this is just plain frustrating. and i've decided that i will make it one of my missions in 2008 to get them to fix this glitch. wish me luck, eh? in the meantime, though, we've gotta accept it, and move on with our mission. so, when you look at the results i will show you, you need to keep in mind the following gotchas: 1. i have removed all the dashes from the results. 2. i have deleted quotemarks from the results too. this means, of course, that this report will slightly _under-estimate_ the number of errors present, since none of the errors involving quotemarks or em-dashes will be detected by the analysis here... another frustration with the o.c.a. workflow is that they lose the pagebreaks from their o.c.r., which means that our first task is restoring those. once again, this is a _stupid_human_decision_, and it reflects _extremely_poorly_ on the o.c.a. in comparison, however, google is even worse. google routinely loses not just pagebreaks and em-dashes, but single- and double-quotemarks, and the hyphens from end-line hyphenates too. it's hard to imagine such mind-blowing stupidity manifested by one of the richest businesses in the world, ordinarily not _nearly_ so dumb about text. but there it is, in black-and-white, for all to see... i mean, _seriously_, doesn't anyone at these big digitization projects even _look_ at their output? these projects have scanned _millions_ of books -- _literally_millions_ -- yet they remain oblivious to a problem that reveals itself within 5 minutes! it's sad. and not just plain sad, it's tragically sad... *** the first thing i had to do was fix the linebreaks in the p.g. e-text so that they would conform to the linebreaks in the o.c.r. text, to be compared... after that, i went on to the next task, involving -- wait a minute, did you just read over that and accept it, as some kind of simple, routine task? if so, think again. restoring those linebreaks was a difficult task, one that took far too much time. sure, i wrote a program that did most of it, but _that_ took some time. and cleanup took more. and -- since there was no good reason for p.g. to rewrap the lines in the first place -- the time that it took to restore the original linebreaks was just wasted time. i did it, because i had to, in order to do this experiment, but still, it was a waste of time. more on rewrapping later... anyway... this first volume of moby dick runs right at 600k. so even though it's only one-half of "moby dick", by itself it would constitute a relatively large book. moreover, it consists of 11,411 (non-blank) lines. keep that number in mind as i discuss my results. my post-o.c.r. clean-up program is "in progress", as i continue to improve it in an iterative process, running it and then comparing outputs, and then improving and re-running it for more comparison. this means that the numbers are kind of "spongy", in the sense that i could keep improving the app till there are virtually _no_ differences remaining between the output it gives and a "criterion" text. but given a fair amount of "twiddling" for this book, i came up with roughly _444_ lines which _differed_ between these versions of volume 1 of moby dick... that's roughly 4% of the 11,411 lines, a percentage which is very close to what i've gotten in other tests... however, you should keep in mind that some of the differences between these two editions were due to the fact that they _are_ separate editions, meaning that some of the 444 differences are accounted for by _edits_ that were made in the (later) p.g. edition. further, the p.g. text had some errors in it as well, which cause differences between that text and the o.c.r. output that o.c.a. obtained from their edition. i estimate that fully _half_ of the 444 different lines were _not_ due to errors in the o.c.r., which means that only _2%_ of the 11,411 lines reflect o.c.r. errors. this figure is also consistent with my previous results. this means that we could proof these o.c.r. results to a very high standard of accuracy by simply examining the 222 lines that were different from the p.g. e-text. thus far, i haven't even looked at the scans themselves, so i cannot give estimates of how many of the 222 lines were _actually_ o.c.r. errors. some of them were likely errors in the original book, which the o.c.r. recognized _correctly_, and thus will not count against its accuracy. plus there are those cases where so-called "o.c.r. errors" should really be attributed to the human operators and the workflow that they have created around the process. (an example of this is the garbage characters that o.c.r. throws when it encounters pencil-marks in the margins; a proper workflow "standardizes" the scans by cropping, so the o.c.r. has a "bounding box" around the text-block, and won't even extend recognition out into the margins. it's unfair to blame o.c.r. for _our_ workflow deficiencies.) when all is said and done, i expect that we will see about 111 lines with o.c.r. errors, or 1% of the 11,411 total lines. if we have tools that can focus us in on those 111 lines, it's obvious that we can proof our books _much_ faster than the look-at-every-word-on-every-page method... and this means that the combination of good o.c.a. o.c.r. and an aggressive post-o.c.r. clean-up program can create text that is phenomenally accurate, with little human help. considering the millions of scan-sets we need to digitize, this is good news indeed. these results confirm those i've obtained on 2 books earlier: > http://www.pgdp.net/phpBB2/viewtopic.php?t=24008 *** next week, i'll share extensive results from this test. i'll show you the global changes that were made by my post-o.c.r. clean-up application-tool. i'll show you the files that ended up being compared. i'll show you the lines that differed between the files. i'll show you how i categorized these different lines. (some were edits, some were errors in the p.g. e-text, and some were punctuation differences; the rest were the lines the o.c.r. _probably_ recognized incorrectly.) if i've gotten around to it, i'll let you know the results of my check of these differences against the scans... finally, i'll show you the lines that _might_ have had "stealth scannos" on them -- lines which _might_ be a _problem_ with the comparison method of proofing, were they found to be a relatively common occurrence. *** for the weekend, however, chew on this little thought: of the 11,411 lines in this e-text, some 11,000 of them were digitized correctly, by the combination of the good o.c.r. from the o.c.a. followed by my post-o.c.r. clean-up. -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080104/f10d8ecb/attachment.htm From Bowerbird at aol.com Fri Jan 4 14:27:36 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Jan 2008 17:27:36 EST Subject: [gutvol-d] chinese versus english Message-ID: please. in today's "posted" digest, the chinese versus english race is tight, with chinese having 9 e-texts posted, and english surging with 10. portuguese and esperanto had 1 each, to fill out the pack... thank you. -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080104/f45de2d6/attachment.htm From julio.reis at tintazul.com.pt Sat Jan 5 12:14:04 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Sat, 05 Jan 2008 20:14:04 +0000 Subject: [gutvol-d] chinese versus english In-Reply-To: References: Message-ID: <1199564044.7678.74.camel@abetarda> Speaking of Chinese e-texts... does anyone have a clue as to how these texts are produced? Have Chinese-speakers some web site like DP? Is it the work of people going solo? Or are these already available as e-texts in some other site? Tintazul. From sly at victoria.tc.ca Sat Jan 5 12:25:28 2008 From: sly at victoria.tc.ca (Andrew Sly) Date: Sat, 5 Jan 2008 12:25:28 -0800 (PST) Subject: [gutvol-d] chinese versus english In-Reply-To: <1199564044.7678.74.camel@abetarda> References: <1199564044.7678.74.camel@abetarda> Message-ID: On Sat, 5 Jan 2008, J?lio Reis wrote: > Speaking of Chinese e-texts... does anyone have a clue as to how these texts are produced? Have Chinese-speakers some web site like DP? Is it the work of people going solo? Or are these already available as e-texts in some other site? > > Tintazul. Interesting question. In case it helps, when I went back through the posted list, checking the recent Chinese texts, I was expecting to see one or two people who were submitting them. Instead, after checking about 20 items, I only saw an email address duplicated once. So it appears these are being submitted by many different people. Since this is happening all at the same time, it would be natural to assume that there is _some_ kind of organization behind it... Andrew From vze3rknp at verizon.net Sat Jan 5 12:45:27 2008 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Sat, 05 Jan 2008 15:45:27 -0500 Subject: [gutvol-d] chinese versus english In-Reply-To: References: <1199564044.7678.74.camel@abetarda> Message-ID: <477FEC67.2060708@verizon.net> Andrew Sly wrote: > On Sat, 5 Jan 2008, J?lio Reis wrote: > > >> Speaking of Chinese e-texts... does anyone have a clue as to how these texts are produced? Have Chinese-speakers some web site like DP? Is it the work of people going solo? Or are these already available as e-texts in some other site? >> >> Tintazul. >> > > Interesting question. In case it helps, when I went back > through the posted list, checking the recent Chinese > texts, I was expecting to see one or two people who > were submitting them. Instead, after checking about 20 > items, I only saw an email address duplicated once. > So it appears these are being submitted by many different > people. Since this is happening all at the same time, > it would be natural to assume that there is _some_ > kind of organization behind it... My understanding, which may be wrong, is that there's a professor (in Taiwan?) who has his students transcribe texts, perhaps as part of a course. They then upload them to PG. That's why the Chinese texts come in clumps, with nothing for a very long time, and then a whole lot. Juliet From julio.reis at tintazul.com.pt Sun Jan 6 13:05:28 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Sun, 06 Jan 2008 21:05:28 +0000 Subject: [gutvol-d] Chinese In-Reply-To: References: Message-ID: <1199653528.19731.46.camel@abetarda> > > My understanding, which may be wrong, is that there's a professor (in > > Taiwan?) who has his students transcribe texts, perhaps as part of a > > course. They then upload them to PG. That's why the Chinese texts come > > in clumps, with nothing for a very long time, and then a whole lot. So all we need at the Portuguese team is to teach those hard-working guys our language :-P because our 6th place in Gutenberg is as good as gone. Long live the Taiwanese powerhouse, and may many more spring up around the Chinese-speaking world. J?lio. From Bowerbird at aol.com Sun Jan 6 14:38:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 6 Jan 2008 17:38:28 EST Subject: [gutvol-d] chinese versus english Message-ID: in the r.s.s. feed today from the posted list, chinese beats english by a score of 3-2, but german pulls a surprise and trumps all with 5. -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080106/b1962fad/attachment.htm From schultzk at uni-trier.de Mon Jan 7 01:57:39 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Mon, 7 Jan 2008 10:57:39 +0100 Subject: [gutvol-d] moby dick -- a report on the state of the art of digitization In-Reply-To: References: Message-ID: <6AE375A2-94C6-4314-B555-49ED68796102@uni-trier.de> Hi Bowerbird, I have said it time and time again you will not get decent ocr or oca without a proper grammar (in the computer linguistic sense) and parser. Without these oca will have the problems you mentioned below. Far as your comparision English-American is concerned you need a decent dictionary/ies and a translation module: 1) colour - color easy enough to handle 2) lorry - truck a synonym dictionary could do this as well point one 3) bonnet - trunk (as the trunk of a car) this problem requires semantics or co-text analysis. 1 and 2 are easy enough and are cheap to do automatically. Except if the author purposely mixes English and American (a very minute percentage I assume). 3 is a different animal. It is the crux of any translation system. Yet, a well designed system will handle 90-95. This type of project would take at least a year of man-power to produce, in other words far to expensive. Dashes are hard to differentiate for any automatic system. there are three kinds. Though they have different lengths, how can a system tell them apart. It would have to have intimate knowledge of the point size and font, which oc* systems do not have. Of course one could try to program the intelligence needed, but I assume it still would not work well. Humans are a lot better at this. Hyphenation and dashes can be differentiated with multi-line parsing, yet many programmers consider the effort not worth it. As time goes by oca will get grammar rules and more intelligence, just like the dictionaries and intelligence improved ocr. regards Keith. Am 04.01.2008 um 19:27 schrieb Bowerbird at aol.com: > ok, i've told you recently that > o.c.r. from the o.c.a. is good. > > and that it can be improved by > post-o.c.r. clean-up programs. > > and then improved even further > by comparison with an existing > digitization, if one should exist, > to the point it can be _finished_, > quickly, even largely automatically. > > *** > > here's a report in support of that... > > i examined the o.c.a. first volume of "moby dick": > > http://www.archive.org/details/mobydickorwhale01melvuoft > > i did an initial comparison of their o.c.r. > with the e-text from project gutenberg: > > http://www.gutenberg.org/etext/2489 > > it didn't take long to determine that the > p.g. e-text was from a different edition > than the one which the o.c.a. scanned... > > the first tip-off was that the o.c.a. edition > used british spellings, not the american > ones which are there in the p.g. e-text. > > this brings up a good point to consider > when we're talking about _comparison_ > as a strategy for correcting o.c.r. text... > > specifically, there are a number of things > that will cause superfluous "differences" > that need to be ignored in comparisons, > like american/british spelling variants... > you don't want these differences flagged. > > also -- as is typical with british editions -- > there was also a difference in quotemarks; > the british use single-quotemarks as default, > and nested quotes use double-quotemarks. > american editions use double-quotemarks > as the default, of course, with any internal > quotes signified by single-quotemarks... > > in addition, one of the differences that you > will frequently find between editions involves > punctuation (especially colons and semicolons, > as well as various takes on hyphenated words), > and this is particularly true when one edition is > from a british publisher, the other an american. > > there were other superficial differences > between these two texts. as a quick list: > 1. chapter numbers (roman versus arabic) > 2. headings (all-upper versus mixed-case) > 3. chapter initial capped words (versus not) > 4. block indents (i.e., the o.c.a. text had none) > > finally, one of the biggest complicating factors > on comparing these e-texts is due to a massive > _incompetence_ in the o.c.a. workflow, namely > that they lose all the em-dashes in their text... > > that's right, you heard me correctly. > > they lose all the em-dashes in their o.c.r. text. > > some stupid person somewhere has evidently > mis-set some toggle, discarding em-dashes... > > a glitch this big is ridiculously unforgivable... > my mind is just boggled that they could even > _make_ such a stupid mistake. but they did... > > even worse, i have tried -- tried repeatedly -- > to bring it to their attention. yet it persists... > > this is just plain frustrating. and i've decided > that i will make it one of my missions in 2008 > to get them to fix this glitch. wish me luck, eh? > > in the meantime, though, we've gotta accept it, > and move on with our mission. [snip snip] -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080107/c6622173/attachment.htm From Bowerbird at aol.com Mon Jan 7 10:43:18 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Jan 2008 13:43:18 EST Subject: [gutvol-d] moby dick -- a report on the state of the art of digitization Message-ID: please. if anyone wants to do the same test on volume 1 of moby dick, just to "keep me honest", please feel free... maybe someone could put it through distributed proofreaders, so that we can check my work when it emerges from d.p., 1 or 2 or 3 years from now. or, if you want to do it on volume 2, go ahead, as i plan to do that next... (split-half reliability tests, ya know.) *** oh yeah. i shoulda mentioned that sometimes i take the difficult route just to be perverse. if you do not feel like doing that, use moby10b instead of moby11, as moby10b _is_ based upon the same edition the o.c.a. scanned... (or at least i _thought_ that was so, for some reason i can't remember, but i might've been wrong on that.) and if you _do_ stay with moby11, be advised of a major deficiency at: > the scene of the catastrophe _ironic_, because there is a loss of text here -- involving 240+ words -- which makes this "the scene of the catastrophe", quite indeed... *** keith said: > you will not get decent ocr without a proper grammar you seem not to notice that we _are_ getting "decent" o.c.r. in fact, when combined with a good post-o.c.r. cleanup tool, the results become quite _astounding_... > Far as your comparision English-American is concerned > you need a decent dictionary/ies and a translation module: i'm not sure what your point is, but it has no applicability here. the only reason that i got any differences of this type is because i was comparing an american edition with an english edition, and that means i just have to harmonize the two, not do a translation. > Dashes are hard to differentiate for any automatic system. um, again, no applicability here. the dashes were recognized, i'm quite sure, but a human glitch in the workflow drops 'em, most probably because they are "high-bit ascii" characters... > they have different lengths, how can a system tell them apart modern o.c.r. does quite well on this task. > Hyphenation and dashes can be differentiated with multi-line > parsing, yet many programmers consider the effort not worth it. oh please. those programmers don't deserve the appellation... > As time goes by oca will get grammar rules and more intelligence, > just like the dictionaries and intelligence improved ocr. we'll have near-perfect recognition of the characters _long_ before then... *** thank you. -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080107/747377e1/attachment.htm From Bowerbird at aol.com Mon Jan 7 14:30:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Jan 2008 17:30:28 EST Subject: [gutvol-d] why cyberspace is cheaper than meat-space Message-ID: please. *** why cyberspace is cheaper than meat-space... meatspace: artist -> company -> wholesaler -> retailer -> audience cyberspace: artist -> audience *** thank you. -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080107/f7a4e1f9/attachment.htm From creeva at gmail.com Tue Jan 8 06:04:26 2008 From: creeva at gmail.com (Brent Gueth) Date: Tue, 8 Jan 2008 09:04:26 -0500 Subject: [gutvol-d] why cyberspace is cheaper than meat-space In-Reply-To: References: Message-ID: <2510ddab0801080604k7606da6ha9f339c0eef989f0@mail.gmail.com> I would say your close. This would be more accurate: meatspace: artist -> company -> wholesaler -> retailer -> audience cyberspace: artist -> distributor -> audience While the artist may be the distributor themselves - there is web site fees, Internet Access fees - these all go up on a curve the more popular the artist is - and to match a normal revenue stream of traditional publishing there will be costs associated with it - either that or they aren't doing their taxes properly ;) We take for granted that publishing on the Internet is "Free" - it's not the free service for distribution normally would require you to go the creative commons route and use archive.org to distribute your data - or use a service that is advertising to your readers, consumers, watchers, etc....in which you don't get a cut or can not adequately control the content. While the cost and barrier to entry in otherways has greatly been reduced we can not turn a blind eye and says it doesn't cost anything - to run the gutenberg it costs money - to run the archive.org it costs money - just because the artist may or may not see it does not negate the fact that it is there. I agree with you that it is the more optimal all around method - I just wanted to make sure that everything is realized. On Jan 7, 2008 5:30 PM, wrote: > please. > > *** > > why cyberspace is cheaper than meat-space... > > meatspace: artist -> company -> wholesaler -> retailer -> audience > > cyberspace: artist -> audience > > *** > > thank you. > > -bowerbird > > > > ************** > Start the year off right. Easy ways to stay in shape. > http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From Bowerbird at aol.com Tue Jan 8 09:57:23 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Jan 2008 12:57:23 EST Subject: [gutvol-d] why cyberspace is cheaper than meat-space Message-ID: brent said: > there is web site fees, Internet Access fees right. and i agree that in some cases, those can be substantial. but i do not agree it means there should be a "distributor" node in between the artist and the audience. those nodes are meant to indicate _entities_that _take_a _cut_ from the revenue stream. web-site and internet access fees are the cost of doing business; indeed, they're pretty much the cost of _being_human_ these days. (yesterday i bought a domain-name for my sisters granddaughter, an hour after she was born; no kid should be without a web-site.) artists of various stripes have always had to incur costs to do art; musicians have to pay money for lessons, and their instruments; artists have to buy brushes and paints and canvas and stretchers; sculptors have to put out a _ton_ of money for their raw material, and tools to work it. digital artists must buy hardware and software. so your internet costs are just another expense on the ledger sheet. i also say it's important to note that there _are_ ways to _minimize_ the costs of your digital distribution. it's not _necessary_, and may even be _undesirable_, for you to bear the full burden of distribution. take advantage of the number-one digital benefit -- i.e., free copies. give fans the explicit right to spread copies of your work far and wide. allow them to put copies on their own web-sites, and let _them_ bear some of the cost. even better, seed copies to peer-to-peer networks, so you don't have to even host an original copy on your own web-site. myspace and facebook are now willing to host all kinds of your content (although i encourage you to examine their terms of service carefully.) youtube -- and other sites too -- will host your video for you, for free. there's lots of photo-hosting sites (e.g., photobucket, flickr, shutterfly). the internet archive -- and p.g. too -- are all too happy to host text... also, lots of bloggers are finding that adsense pays their hosting bills. so, while granting your point, i believe it's a relatively small concern... -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080108/0c2e2dbc/attachment.htm From creeva at gmail.com Tue Jan 8 10:10:36 2008 From: creeva at gmail.com (Brent Gueth) Date: Tue, 8 Jan 2008 13:10:36 -0500 Subject: [gutvol-d] why cyberspace is cheaper than meat-space In-Reply-To: References: Message-ID: <2510ddab0801081010t6d5f713ak68c6b0e963da145e@mail.gmail.com> I completely agree with you and say it's a small concern - it's just a post production cost of business - compared ot the pre production cost (supplies, lessons, etc.) in most regards it's nominal and can be almost completely free - it's just there and I wanted to throw it out for those who think it's a completely free ride. On Jan 8, 2008 12:57 PM, wrote: > brent said: > > there is web site fees, Internet Access fees > > right. and i agree that in some cases, those can be substantial. > > but i do not agree it means there should be a "distributor" node > in between the artist and the audience. those nodes are meant > to indicate _entities_that _take_a _cut_ from the revenue stream. > > web-site and internet access fees are the cost of doing business; > indeed, they're pretty much the cost of _being_human_ these days. > > (yesterday i bought a domain-name for my sisters granddaughter, > an hour after she was born; no kid should be without a web-site.) > > artists of various stripes have always had to incur costs to do art; > musicians have to pay money for lessons, and their instruments; > artists have to buy brushes and paints and canvas and stretchers; > sculptors have to put out a _ton_ of money for their raw material, > and tools to work it. digital artists must buy hardware and software. > > so your internet costs are just another expense on the ledger sheet. > > i also say it's important to note that there _are_ ways to _minimize_ > the costs of your digital distribution. it's not _necessary_, and may > even be _undesirable_, for you to bear the full burden of distribution. > take advantage of the number-one digital benefit -- i.e., free copies. > give fans the explicit right to spread copies of your work far and wide. > allow them to put copies on their own web-sites, and let _them_ bear > some of the cost. even better, seed copies to peer-to-peer networks, > so you don't have to even host an original copy on your own web-site. > > myspace and facebook are now willing to host all kinds of your content > (although i encourage you to examine their terms of service carefully.) > youtube -- and other sites too -- will host your video for you, for free. > there's lots of photo-hosting sites (e.g., photobucket, flickr, > shutterfly). > the internet archive -- and p.g. too -- are all too happy to host text... > > also, lots of bloggers are finding that adsense pays their hosting bills. > > so, while granting your point, i believe it's a relatively small concern... > > -bowerbird > > > > > ************** > Start the year off right. Easy ways to stay in shape. > http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From hart at pglaf.org Tue Jan 8 15:19:25 2008 From: hart at pglaf.org (Michael Hart) Date: Tue, 8 Jan 2008 15:19:25 -0800 (PST) Subject: [gutvol-d] why cyberspace is cheaper than meat-space In-Reply-To: References: Message-ID: I see all this noise about money. Just where is this money supposed to be going? Or coming from, for that matter? mh From Bowerbird at aol.com Tue Jan 8 15:40:11 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Jan 2008 18:40:11 EST Subject: [gutvol-d] why cyberspace gives twice as much money to artists as meatspace does Message-ID: why cyberspace gives more money to artists than meatspace... meatspace:? artist -$1-> company -$4-> wholesaler -$8-> retailer -$16-> audience cyberspace: artist -$2-> audience (dollar-amounts represent the selling-price per unit, so are paid right-to-left; so out of the $16 price for an average book or music c.d., the artist receives $1.) -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080108/344dad1f/attachment.htm From Bowerbird at aol.com Tue Jan 8 15:48:29 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Jan 2008 18:48:29 EST Subject: [gutvol-d] moby dick -- data in support of the report on the state of the art of digitization Message-ID: please. here's those results from my research on comparing cleaned-up o.c.r. results to an existing digitization... to remind you, we're doing volume 1 of moby dick: > http://www.archive.org/details/mobydickorwhale01melvuoft > http://www.gutenberg.org/etext/2489 *** before beginning, let me make a brief comment on "frankenstein" texts, assembled from many sources. one of the p.g. versions of moby dick -- there are three of them -- evolved through a few iterations: > This text is a combination of etexts, one from > the now-defunct ERIS project at Virginia Tech > and one from Project Gutenberg's archives. > The proofreaders of this version are indebted to > The University of Adelaide Library for preserving > the Virginia Tech version. The resulting etext > was compared with a public domain hard copy > version of the text. it doesn't matter. each line is either right or wrong. if a line is right, it doesn't matter how it came about. and if it's wrong, it doesn't matter how it got that way. i'll have more to say about "frankenstein" texts later, but for now, the important point is "it doesn't matter". **** the results of this experiment replicated 2 done before, confirming a word-by-word examination of each page is unnecessarily wasteful of human time and energy in the production of a highly-accurate book digitization... in this regard, it is perhaps useful to get an overview of the state-of-the-site over at distributed proofreaders... in 2007, d.p. posted 2,222 finished e-texts to p.g. there are maybe 2-4 times that many books in their system at the current time, being processed or in queues waiting... it is for this reason that most books now can be expected to take anywhere from 2-4 years to traverse the system... most of the thousands of people who volunteer on the site probably don't care much how long a book takes to digitize, but _some_ do, and are becoming _increasingly_ unhappy about the long time-period it takes to produce most books. i've remarked before that i think d.p.'s workflow is grossly inefficient, and that it wastes far too much time and energy of the human beings who are volunteering their services... i won't elaborate on it here, again, but i really do think that it needs to be kept in mind when considering these results... *** to do a comparison involves first a "shaping" of the files... not to give away all my secrets here, but the initial step of this "shaping" is to mold files with matching paragraphs... for the most part, paragraphing in the p.g. e-text was clear, but the o.c.a. text required quite a bit of work in this regard. (e.g., runheads needed to be removed, paragraphs fixed...) the following step was to rewrap the lines of the p.g. e-text according to the linebreaks from the o.c.a. file. as i've said, this involved quite a lot of work as well, most of it because i am writing and perfecting a program to perform this task. the next step was to track down the _irrelevant_ differences, and provide a means of controlling for them. this wrinkle is why an off-the-shelf tool like "wdiff" won't work for this job. you might remember that a while back, i invited carlo to give a workshop in using wdiff to compare different digitizations. he never came through with it, but he _has_ recently posted the output from one of his efforts to do such a comparison: > http://posso.dm.unipi.it/~traverso/Restricted/sh-wd3.txt i encourage you to take a good hard look at his output, and evaluate its usefulness compared to my output shown here... the last step in the process is to find and correct differences. (i'll probably elaborate on this last step in some future posts.) now that i've given you a summary of the comparison process, let's examine each of those steps a little bit more closely, ok? *** you will recall that -- because of some stupid human being who set a toggle incorrectly over at the o.c.a. -- their text is _missing_ all of its em-dashes. therefore, i decided to delete the dashes from the p.g. e-text, to avoid spurious differences. likewise, because the p.g. e-text used american double-quotes for conversation, while the o.c.a. text used british single-quotes, i changed all double-quotes to single-quotes, so they'd match... finally, i believe i standardized some punctuation differences too. keep all these changes in mind when you examine my results... *** ok, finally, here is some real, honest-to-goodness data, at last! in my last post, i said: > i'll show you the files that ended up being compared. the "shaped" o.c.a. text is located here: > http://z-m-l.com/misc/mobyv1-oca-worked.txt and the "shaped" p.g. e-text is here: > http://z-m-l.com/misc/moby11-all-worked.txt *** in my last post, i said: > i'll show you the global changes that were made by > my post-o.c.r. clean-up application-tool. i've appended to this post the changes that i made to the file. for the most part, these clean-up routines will be familiar to anyone experienced with the text typically returned by o.c.r. (i didn't do anything distributed proofreaders couldn't do...) for instance, i treat punctuation and characters "gone wild", as listed, and searched for some high-probability scannos. (the 3 listed were the only ones which occurred in this file; ironically, in this book about a whale, there were 2 cases of an _actual_ "arid" in this book, and 2 where it was a scanno.) after the general class of _garbage_clean-up_, the next class was concerned with the changes made by the editing process in the creation of the _later_edition_ used for the p.g. e-text... most of the cases in this latter class were _british_ spellings; i have included those, so you can see what actually occurred. there were also instances that might be british variants, or might just be _editorial_decisions_, i'm not sure, including: > *wards // *ward > ay // aye > phrensy // frenzy in general, british variants use "*ise", "*ising", and "*isation" -- e.g., characterise, characterising, and characterisation -- whereas american variants use "*ize", "*izing, and "*ization" -- e.g., characterize, characterizing, and characterization... you can't do a global change because some british variants do contain the "ize" or "izing". i listed ones occurring here: > baptize, capsize, denizen, mizen*, seize, seizing, size but as you can see, there are quite of few different terms, and a number of them were of a relatively high frequency, so controlling for them was a necessary part of this task... the process of getting these term-pairs to simply "drop out" of the files is a fairly complicated one, but i believe that i am starting to get a relatively good handle on that programming. *** in my last post, i said: > i'll show you the lines that differed between the files. there were 444 lines where the files exhibited differences: it's important to remember that not all these differences were errors in the o.c.r., and i elaborate on that fact next. *** in my last post, i said: > i'll show you how i categorized these different lines. > (some were edits, some were errors in the p.g. e-text, > and some were punctuation differences; the rest were > the lines the o.c.r. _probably_ recognized incorrectly.) based on just a cursory look at each pair of different lines, i made a rough split of them into the following categories... one category was 96 _edits_ made in the later (p.g.) e-text: > http://z-m-l.com/misc/moby-out11-goodedit-96.html your review of these pairs will demonstrate what i mean. the next category was 48 _likely_errors_ in the p.g. e-text: > http://z-m-l.com/misc/moby-out11-badedit-48.html again, your brief review should give an idea what i mean. another obvious category was 64 _punctuation_differences_: > http://z-m-l.com/misc/moby-out11-punct-64.html some of these might be o.c.r. errors, but for the most part, it seemed to me these were intentionally-produced edits... the next category was a special one -- 14 "stealth scannos": > http://z-m-l.com/misc/moby-out11-stealth-14.html i pulled these for special consideration, discussed below. finally, the category we are interested in, 222 o.c.r. errors: > http://z-m-l.com/misc/moby-out-scanno-222.html this category's 222 pairs were _half_ of the original 444... *** for those of you who didn't actually go and _look_ at those, here's a few samples showing how i present the differences. view these using a monospaced font for the superior results; the third line underneath each pair helps you quickly perceive where the lines differ; learn to use it, as it's extremely handy... (i've tested presentation of differences; this is the _best_ way.) > Vhe bulwarks of ships from China; some high aloft in > the bulwarks glasses! of ships from China; some high aloft in > x============xxx=xxxx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > professor. Yes as everyone knows meditation andli > professor. Yes as every one knows meditation and > =======================xxxxxxxxxxxxxxxxxxxxxxxxxx > you yourself feel such a mystical vibration when first; > you yourself feel such a mystical vibration when first > ======================================================x > of the great whale himself. Such a gortentous and > of the great whale himself. Such a portentous and > ===================================x============= > With anxious grapnelsJE had sounded my pocket and only > With anxious grapnels I had sounded my pocket and only > =====================xx=============================== > less service the soles of mv boots were in a most miserable > less service the soles of my boots were in a most miserable > ===========================x=============================== > hear the sounds of the tinkling glasses within. But go i > hear the sounds of the tinkling glasses within. But go > ======================================================cx > on Ishmael said I at last; dont you hear? get away l > on Ishmael said I at last; dont you hear? get away > ==================================================cx > all but deserted. But presently I carne to a smoky > all but deserted. But presently I came to a smoky > ====================================xxxxxxxxxxxxxx > city Gomorrah? But The Cfossed Harpoons and > city Gomorrah? But The Crossed Harpoons and > ========================x================== > it from that Cashless window where the frost is on both > it from that sashless window where the frost is on both > =============x========================================= *** ok, those 11 difference-pairs constitute the probably scannos found in the first two chapters (11 pages) of our book here... so let's take them individually, and look at the page-scans, to see if we can tell exactly what might've caused the error. > Vhe bulwarks of ships from China; some high aloft in > the bulwarks glasses! of ships from China; some high aloft in > x============xxx=xxxx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx the capital-v is a scanno. the "glasses!" is an _edit_ -- i probably called it a bad edit -- in the p.g. e-text. > professor. Yes as everyone knows meditation andli > professor. Yes as every one knows meditation and > =======================xxxxxxxxxxxxxxxxxxxxxxxxxx the "li" was caused by pencil-marks in the page's margin. although we can't really call that an o.c.r. "error" -- better cropping would have ensure that stuff like that would not happen -- but it still has to be cleaned up... > you yourself feel such a mystical vibration when first; > you yourself feel such a mystical vibration when first > ======================================================x again, pencil-marks in the margin... > of the great whale himself. Such a gortentous and > of the great whale himself. Such a portentous and > ===================================x============= pencil-mark underlining the word could have caused this... > With anxious grapnelsJE had sounded my pocket and only > With anxious grapnels I had sounded my pocket and only > =====================xx=============================== again, pencil-mark underlining probably caused this... > less service the soles of mv boots were in a most miserable > less service the soles of my boots were in a most miserable > ===========================x=============================== straightforward o.c.r. error... > hear the sounds of the tinkling glasses within. But go i > hear the sounds of the tinkling glasses within. But go > ======================================================cx pencil-mark in the margin again... > on Ishmael said I at last; dont you hear? get away l > on Ishmael said I at last; dont you hear? get away > ==================================================cx the same pencil-mark from above caused this glitch too. > all but deserted. But presently I carne to a smoky > all but deserted. But presently I came to a smoky > ====================================xxxxxxxxxxxxxx straightforward o.c.r. error... > city Gomorrah? But The Cfossed Harpoons and > city Gomorrah? But The Crossed Harpoons and > ========================x================== a stray printer's mark on the page caused this one... > it from that Cashless window where the frost is on both > it from that sashless window where the frost is on both > =============x========================================= another pencil-mark -- circling this word -- caused this... *** summing it all up, then, we've got 7 pencil-marks as causes, and 1 bad edit (in the p.g. e-text), and 1 printer's mark snafu, and 3 scannos. all in all, i'm impressed with the o.c.r. quality... so, yeah, this book had a lot of pencil-marks in the margin. and, quite honestly, when you consider it's been in a library for maybe 50 or 100 years, that's not all that surprising, is it? but it does show how ridiculous it would be to _honor_ these pages with an extremely-high scanning resolution, which is what some obsessively-compulsive people want us to do, as if they constituted some pristine "idealized" version of this book. and, to repeat, we certainly cannot blame our o.c.r. programs when we feed them pages that we have not cropped properly. if we don't want it to attend to stuff in the margins, then we should draw a bounding-box around the text-block for it... (not much we can do about marks _within_ the text-block, but then again, maybe some image-wizards could fix that up too.) the point is, our o.c.r. programs are doing some _great_ work... (there were lots and lots of marks that _didn't_ foul up the o.c.r.) in conclusion, if we did our jobs as well as the o.c.r. does its job, the text that came out of our workflow would be nearly perfect... *** in my last post, i said: > if i've gotten around to it, i'll let you know the results > of my check of these differences against the scans... i didn't bother to do this yet. i'm thinking that i will use moby10b.txt instead, as it's based on the same edition of the book as the one which was scanned by the o.c.a. (at least i had some reason to think that might be true, even though i cannot remember what that reason was.) at any rate, i think you can tell by examining the 222 lines where differences occurred that some had scanning errors in them, while others might not have. good enough for me. even if all 222 had scanning errors, that's just 2% of the file. *** in my last post, i said: > finally, i'll show you the lines that _might_ have had > "stealth scannos" on them -- lines which _might_ be > a _problem_ with the comparison method of proofing, > were they found to be a relatively common occurrence. a "stealth scanno" is an o.c.r. error that will _not_ show up in a regular spell-check, because the "wrong" word is "valid", in the sense that it is the _correct_ spelling for another word. so if "stage" is incorrectly recognized as "state", it won't be flagged by a spell-checker, because "state" is also a word... (whereas, for instance, "spage" would be flagged as wrong.) "stealth" scannos are of concern to the comparison method -- especially when we compare two sets of o.c.r. results -- because they might occur in both of the digitizations and -- since an identical error would lead to the lines "matching" -- we wouldn't look at that line, so we would _miss_the_error_. therefore, in these experiments of mine, i'm super-sensitive to any stealth-type scannos that might be uncovered, to see if their frequency would constitute a troubling significance... in the first 2 experiments, i was staggered to discover that i found _not_a_single_trace_ of troublesome stealth scannos. i was prepared to accept a _small_ number of stealth scannos, since they wouldn't disturb the cost-benefit ratio _that_ much, but i found _none_, even with some rather extensive checking. in _this_ test, i uncovered instances that _might_have_been_ stealth scannos. so, of course, i checked them very carefully. this experiment was a great way to uncover stealth scannos, since the p.g. text had gone through several human proofers, and thus presumably could be thought free of stealth scannos. (it's too bad we don't have human-proofed text for _all_ books!) again, i was surprised by the infrequency of stealth scannos... i found only _two_ in this entire 600k of text: > ----> match // watch (p. 214) stealth (italics) > (Foresail rises and discovers the match standing lounging > (Foresail rises and discovers the watch standing lounging > ==================================x====================== > ----> shook // shock (p. 259) error (footnote) > and thereby combining the speed of the two objects for the shook; to > and thereby combining the speed of the two objects for the shock; to > ==============================================================x===== it's worth noting that one of these two stealth scannos occurred on an _italicized_ word, and the other was in a (small-font) footnote, where the "c" does look very much like an "o", even to a human eye. in a dozen other cases that _might've_ constituted a stealth scanno, a review of the scan itself showed the o.c.r. had recognized correctly. (the difference, accidental or intentional, was in the p.g. e-text.) those dozen other cases are listed here: > ----> state // stage (p. 34) o.c.r. was correct > tion state neither caterpillar nor butterfly. He was > tion stage neither caterpillar nor butterfly. He was > ========x=========================================== > ----> distinct // distant (p. 48) o.c.r. was correct > face shed a distinct spot of radiance upon the ships tossed > face shed a distant spot of radiance upon the ships tossed > ================xxx======================================== > ----> whaleman // whalemen (p. 137) o.c.r. was correct > the whaleman who first broke through the jealous policy > the whalemen who first broke through the jealous policy > ==========x============================================ > ----> liberally // literally (p. 149) o.c.r. was correct > these cases the native American liberally provides the > these cases the native American literally provides the > ==================================x==================== > ----> odd // old (p. 157) o.c.r. was correct > tanrail to mainmast Stubb the odd second mate came > taffrail to mainmast Stubb the old second mate came > ===xx=========================xxxx================= > ----> place // space (p. 166) o.c.r. was correct > the various species or in this place at least to much of > the various species or in this space at least to much of > ===============================xx======================= > ----> those // these (p. 167) o.c.r. was correct > given you those items. But in brief they are those: > given you those items. But in brief they are these: > ===============================================x=== > ----> ever // even (p. 192) o.c.r. was correct > and ever when most obscured by that London smoke > and even when most obscured by that London smoke > =======x======================================== > ----> had // has (p. 223) o.c.r. was correct > assailants had completely escaped them; to some minds > assailants has completely escaped them; to some minds > =============x======================================= > ----> not // nor (p. 224) o.c.r. was correct > influences at work. Not even at the present day has the > influences at work. Nor even at the present day has the > ======================x================================ > ----> his // its (p. 281) o.c.r. was correct > his boats bow with his tail these allusions of his were at > his boats bow with its tail these allusions of his were at > ===================xx===================================== > ----> whole // while (p. 322) o.c.r. was correct > mind to flog them all round thought upon the whole > mind to flog them all round thought upon the while > ===============================================x== the results in general support the notion that the o.c.r. from the o.c.a. is _extremely_good_, but these particular results specifically verify it... a near-total absence of stealth scannos is a testament of high quality, and the o.c.a. o.c.r. has demonstrated it in every test of my research... i don't know exactly what o.c.r. app they're using, but it's _good_, folks. if only o.c.a. fixed their glitches, so we could actually _use_ their text... *** so, at the top of this post, i said "frankenstein" e-texts do have their place with this comparison methodology. at the same time, however, they have little enduring use. because we now have a scan-set that has a lifetime which is probably as rock-solid as we can imagine for _any_ file, being backed by one of the world's biggest companies... nothing is certain, of course, but the continued presence of this scan-set at google is _relatively_ assured to people. given that, my digital-text companion-file also probably has a good chance of being mirrored into the far future... since i will be linking my text-file up to the actual scans, so the text can be verified in a very convenient manner, the _accuracy_ of my digitization can be checked easily... the project gutenberg e-texts, however, will have utility that is relatively small now, since they cannot reliably be verified by a particular scan-set, even a shortlived one... it's not that i think _trustworthiness_ is all that essential; i have argued that most people don't care about it much. and indeed, as i've shown, i'm not reluctant to _do_edits_. but if one text-file offers _verifiable_ "trustworthiness", while another version simply says "trust me", it's certain which of the two will be the one people come to prefer... and yes, i'm aware that the p.g. plan is to post scan-sets, and i know that many are in the process of being posted; but with the lines rewrapped, any verification process will always be a clumsy one. make of that whatever you want. but if i were you, i'd stop rewrapping the lines right now... any rewrapped text that can't be verified will be trashed... *** repeating the main results, in this book of 11,411 lines, only 2% of them -- 222 -- were different across the files. moreover, as my output shows, it's not even the case that we need to proof those _entire_lines_, because a simple routine clearly shows exactly where the difference occurs, so it's often a matter of focusing on _a_single_character_. and resolving our 222 differences, by viewing the scans, leads to a digitization that's undoubtedly highly accurate. i won't claim perfection, because glitches always happen, but given the minimal amount of human time and energy required by this comparison methodology relative to the other methodologies commonly used, i _strongly_ believe the cost-benefit ratio of my methodology is far superior... this text is clean enough for "continuous proofreading" -- a smooth-reading by people interested in its content. my oft-repeated standard for that is 1-error-in-10-pages, and there's no uncertainty at all that we reached that level... *** it's obvious that we can compare digitizations much faster than the proof-every-word-on-every-page methodology. the results show that the combination of good o.c.a. o.c.r. and an aggressive post-o.c.r. clean-up program can create text that is phenomenally accurate, with little human help. considering the millions of scan-sets we need to digitize, this is good news indeed. these results confirm those i've obtained on 2 books earlier: > http://www.pgdp.net/phpBB2/viewtopic.php?t=24008 at some point, d.p. will no longer be able to _ignore_ this... even now, i believe they are on questionable moral ground, considering the time and energy that their workflow wastes, human resources that might well have been volunteered with a reasonable expectation that they were being utilized wisely. would you give to the red cross if you knew they wasted money? at any rate, i welcome any feedback on this research experiment. thank you. -bowerbird p.s. appended are global changes made for the comparison, which took two forms: cleaning up the o.c.r., and controlling for the edits. > -------------------------------------------------------------- > -------------------------------------------------------------- > -------------------------------------------------------------- > global changes made to clean up the o.c.r. garbage > -------------------------------------------------------------- > -------------------------------------------------------------- > -------------------------------------------------------------- > > *****> characters gone wild > > * [ ] { } | / \ < > _ @ # $ % ^ & > > -------------------------------------------------------------- > > *****> contractions gone wild > > 's > he?s > he? s > J s to 's > I 've > I Ve > ye 've > we 've > Ve > > -------------------------------------------------------------- > > *****> improper (or unlikely) punctuation settings > > space-period > period-space-lowercase > comma-space-uppercase (not a name) > > -------------------------------------------------------------- > > *****> improper (or unlikely) double-punctuation > > ,; > :: > > -------------------------------------------------------------- > > *****> common stealth scannos > > arid = and (whole word) > lie = he (whole word) > hi = in (whole word) > > -------------------------------------------------------------- > > *****> one-letter words that are not "o" or "a" or capital "i" > > c // ' > hear the sounds of the tinkling glasses within. But go i > on, Ishmael, said I at last; don't you hear? get away l > 'Landlord,' I whispered, w that ain't the harpooneer, > > -------------------------------------------------------------- > > *****> improper (or unlikely) line-starting punctuation > > . semble; in some sort, did still. But that thing of his > )xtras with startling accounts of commonplaces never > ! life is gulped and gone. Steward, refill! > > -------------------------------------------------------------- > > *****> improper (or unlikely) line-ending punctuation > > thousand boat lowerings ere the White Whale had torn( > > -------------------------------------------------------------- > -------------------------------------------------------------- > -------------------------------------------------------------- > global changes made as controls for the edits > -------------------------------------------------------------- > -------------------------------------------------------------- > -------------------------------------------------------------- > > *****> spellings changed between editions, including british variants > > &c // etc. > afterwards > agonising > armour > around // round > ay // aye > bedwards > behaviour > Cooke // Cook > caster // castor > characterising > civilise > clamour > colour > connection // connexion > considerating // considering > Duodecimoes // Duodecimos > dropt // dropped > enclosed // inclosed > endeavour > envelops // envelopes > favour > favourite > flavour > generalising > grey // gray > Hallo // Halloa > Hollo // Halloa > Holloa // Halloa > hindoo // hindu > homewards > honour > honourable > humour > idealised > idolater // idolator > individualising > insure // ensure > inwards > jeopardise > labour > licence > monopolising > neighbour > organise > Pottsfich // Pottsfisch > parlour > patronising > phrensy // frenzy > phrensies // frenzies > popularise > pulverise > realise > recognise > reverie // revery > rumour > savor > scrutinising > sermonising > soliloquise > southwards > specialities // specialties > spiralise > succour > symbolise > symbolisings > systematised // systemised > systemised > tantalising > tranquillise > uncivilise > upwards > valour > vapour > vigour > villan // villain > villanous // villainous > yea // yes > _Pequod_ > _Pequod_'s > > -------------------------------------------------------------- > > *****> british words that _do_ have "ize" or "izi" > > baptize > capsize > denizen > mizen > mizentop > seize > seizing > size > > -------------------------------------------------------------- ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080108/260f196d/attachment-0001.htm From creeva at gmail.com Tue Jan 8 17:52:02 2008 From: creeva at gmail.com (Brent Gueth) Date: Tue, 8 Jan 2008 20:52:02 -0500 Subject: [gutvol-d] why cyberspace is cheaper than meat-space In-Reply-To: References: Message-ID: <2510ddab0801081752i4a01231evc37216380bcb0539@mail.gmail.com> I think he was talking in the abstract On Jan 8, 2008 6:19 PM, Michael Hart wrote: > > I see all this noise about money. > > Just where is this money supposed to be going? > > Or coming from, for that matter? > > > > mh > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From joyce.b.wilson at sbcglobal.net Wed Jan 9 10:46:20 2008 From: joyce.b.wilson at sbcglobal.net (Joyce Wilson) Date: Wed, 09 Jan 2008 12:46:20 -0600 Subject: [gutvol-d] Duplicate texts? Message-ID: <4785167C.1040101@sbcglobal.net> I ran across a couple of pairs of seemingly identical Chinese texts: http://www.gutenberg.org/etext/24120 and http://www.gutenberg.org/etext/24140 Also http://www.gutenberg.org/etext/24112 and http://www.gutenberg.org/etext/24155 Apologies if this is the wrong place to post about them. Joyce W From hart at pglaf.org Wed Jan 9 11:29:50 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 9 Jan 2008 11:29:50 -0800 (PST) Subject: [gutvol-d] Duplicate texts? In-Reply-To: <4785167C.1040101@sbcglobal.net> References: <4785167C.1040101@sbcglobal.net> Message-ID: Forwarded your noe to our CEO and Prof. Mao. Thanks!!! Michael S. Hart Founder Project Gutenberg Recommended Books: Dandelion Wine, by Ray Bradbury: For The Right Brain Atlas Shrugged, by Ayn Ran,: For The Left Brain [or both] Diamond Age, by Neal Stephenson: To Understand The Internet The Phantom Toobooth, by Norton Juster: Lesson of Life. . . On Wed, 9 Jan 2008, Joyce Wilson wrote: > I ran across a couple of pairs of seemingly identical Chinese texts: > > http://www.gutenberg.org/etext/24120 > and > http://www.gutenberg.org/etext/24140 > > Also > > http://www.gutenberg.org/etext/24112 > and > http://www.gutenberg.org/etext/24155 > > Apologies if this is the wrong place to post about them. > Joyce W > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Wed Jan 9 13:54:19 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Jan 2008 16:54:19 EST Subject: [gutvol-d] an insistence on doing things the hard way Message-ID: please. what do you do with people who insist on doing things the hard way? i dunno. d.p. is persisting in their attempts to find a way of measuring the "confidence" that a certain page has been proofed "enough" times. now they put out a call for "statisticians and data analysis gurus" to help them solve this problem, which they believe must be done before they can implement a "roundless" system. how stupid. how silly. the "secret" is very simple, and i've given it to them repeatedly... they're asking the wrong question. you know that a page is "done" when a certain number of people -- pick a number, any number -- have looked at the page and can no longer find anything to correct. you don't need any fancy statistics. you don't need _any_ statistics. you just need to see whether any changes were made to the page... and you don't even need to know _what_ was changed, you only need to know _whether_ any change was made, so you need nothing more than a simple equivalence test: > if before-text = after-text then no changes were made. _any_ time a page is changed -- _any_ change, on _any_ page -- that change should be reviewed to make sure that it was correct. if you don't have that as a solid policy which can never be violated, you're going to have errors slipping through. it's purely inevitable. but if you _do_ have that as a solid policy, you need no other policy. the only question remaining is how many times it must be verified... i told them this years ago. and i am telling them _again_ right now. and i'll probably have to repeat it still another time, years from now. what do you do with people who insist on doing things the hard way? i dunno. maybe feel sorry for them because they're stupid? or mock them because they're stupid? i dunno. you tell me. thank you. -bowerbird p.s. and yes, this is _doubly_ stupid in light of the research i just posted showing that -- after a good post-o.c.r. clean-up program -- most pages won't have _any_ errors in them, and the ones that do will get a laser-focus. p.p.s. and the doubling cube of stupidity does another flip when they add in the "confidence score" they want to assign to each and every _proofer_ as well. i won't even bother to deal with that asinine nonsense... ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080109/7670d106/attachment.htm From schultzk at uni-trier.de Thu Jan 10 02:56:26 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Thu, 10 Jan 2008 11:56:26 +0100 Subject: [gutvol-d] an insistence on doing things the hard way In-Reply-To: References: Message-ID: <1CD6168E-A541-46E1-A9A5-936EFDDF25B5@uni-trier.de> Hi Bowerbird, I will disagree here with you. Though you are right it could work out, but quality could lack. Alot of coulds. The DP approach could prove to be better and does have it very good cavets. It all depends on the implementation and design off their parameters. No of which I can say anything about. Yet, as we both know DP and the way they handle things they will not get it right. They do not need "statisticians and data analysis gurus", but a good linguistics. They will be well a acquainted with proofing and know more than enough about data analysis and statistics. Even better would be a computer linguist, like me, but I do not like the DP way of things. Before the flames come in about my stance on DP let me say DP is doing a great job of producing texts. regards Keith. Am 09.01.2008 um 22:54 schrieb Bowerbird at aol.com: > please. > > what do you do with people who insist on doing things the hard way? > > i dunno. > > d.p. is persisting in their attempts to find a way of measuring the > "confidence" that a certain page has been proofed "enough" times. > > now they put out a call for "statisticians and data analysis gurus" > to help them solve this problem, which they believe must be done > before they can implement a "roundless" system. > > how stupid. how silly. > > the "secret" is very simple, and i've given it to them repeatedly... > > they're asking the wrong question. you know that a page is "done" > when a certain number of people -- pick a number, any number -- > have looked at the page and can no longer find anything to correct. > > you don't need any fancy statistics. you don't need _any_ statistics. > you just need to see whether any changes were made to the page... > > and you don't even need to know _what_ was changed, > you only need to know _whether_ any change was made, > so you need nothing more than a simple equivalence test: > > if before-text = after-text then no changes were made. > > _any_ time a page is changed -- _any_ change, on _any_ page -- > that change should be reviewed to make sure that it was correct. > > if you don't have that as a solid policy which can never be violated, > you're going to have errors slipping through. it's purely inevitable. > > but if you _do_ have that as a solid policy, you need no other policy. > the only question remaining is how many times it must be verified... > > i told them this years ago. and i am telling them _again_ right now. > and i'll probably have to repeat it still another time, years from > now. > > what do you do with people who insist on doing things the hard way? > > i dunno. maybe feel sorry for them because they're stupid? > or mock them because they're stupid? i dunno. you tell me. > > thank you. > > -bowerbird > > p.s. and yes, this is _doubly_ stupid in light of the research i > just posted > showing that -- after a good post-o.c.r. clean-up program -- most > pages > won't have _any_ errors in them, and the ones that do will get a > laser-focus. > > p.p.s. and the doubling cube of stupidity does another flip when > they add in > the "confidence score" they want to assign to each and every > _proofer_ as well. > i won't even bother to deal with that asinine nonsense... > > > > ************** > Start the year off right. Easy ways to stay in shape. > http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080110/094eb85e/attachment.htm From Bowerbird at aol.com Thu Jan 10 10:43:45 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Jan 2008 13:43:45 EST Subject: [gutvol-d] chinese versus english Message-ID: after several days where english swamped chinese -- that d.p. is an awesome digitizing machine -- chinese makes a surge back today to take it 11-4. -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080110/b009a9a2/attachment.htm From Bowerbird at aol.com Fri Jan 11 13:33:34 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Jan 2008 16:33:34 EST Subject: [gutvol-d] state of the art Message-ID: please. next week, i'll conclude my "state of the art" report... the wrap-up will include topics such as the front-matter, a g.u.i. for making corrections, and conversion to z.m.l., including the preparation for "continuous proofreading". if anyone has any questions or comments, you can post them here or send them to me backchannel... the conclusion, however, should already be very clear. this comparison methodology works extremely well, and it is an order of magnitude more _efficient_ than the old system of proofing every word on every page. this has been demonstrated in 3 separate tests now, with stark and striking results obtained in each one... i will do one or two additional replications, and then turn my focus to perfecting tools to make it happen... have a nice weekend. thank you. -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080111/1c04d208/attachment.htm From ricardofdiogo at gmail.com Sun Jan 13 18:44:57 2008 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Mon, 14 Jan 2008 02:44:57 +0000 Subject: [gutvol-d] #23961 copyrighted remove from catalog Message-ID: <9c6138c50801131844r378d0efeh58c14855e902b93f@mail.gmail.com> Etext #23961 (Manifesto Anti-Dantas) is not in the public domain in the USA under the pre-1923 rule. Unless the editor gave his permission, it must be removed from the catalog. 1916 is _NOT_ the publication date. It's part of the text. In cases where the publishing date is not prominent in Portuguese Title Pages, you are much welcome to ask for my help before giving the clearance line. Ricardo From Catenacci at Ieee.Org Mon Jan 14 05:44:47 2008 From: Catenacci at Ieee.Org (Onorio Catenacci) Date: Mon, 14 Jan 2008 08:44:47 -0500 Subject: [gutvol-d] Status of Magazine Articles Message-ID: Hi all, Apologies in advance if this has been discussed recently. I've not been getting the PG Volunteer Discussion List mails for some time due to some other priorities taking my time. Also, if this is answered on a FAQ, please just point me to it. I'd like to digitize articles from a magazine from 1929. I'm fairly sure that the magazine did not pay its writers so I'm also fairly sure that the writers would have retained copyright. However, there's no way that I know of to be certain that the writers were not paid for their work. Is there any way that I can confirm the copyright status of these individual articles? Assuming the authors were not paid for their work, would the authors (or their estates) retain copyright? I mean I know that copyright laws have changed since 1929 but I'd think the work for hire aspects would have been the same even then. I've managed to get hold of the family of one of the authors and they've tentatively given me permission to reprint the article; I would have asked for their permission even if I were sure the article is in the Public Domain because it just seems rude to me not to ask and I don't believe that asking permission changes the basic copyright status of the article either way. I just want to make sure I'm not opening myself up to a copyright infringement lawsuit from some other relative who may think they can make a fast buck. Any advice would be greatly appreciated. If anyone knows of a directory of lawyers that know IP law, that'd be fine with me; I don't mind paying for legal advice. I just don't know where I would find lawyers that have expertise in this area of law. -- Onorio Catenacci III From grythumn at gmail.com Mon Jan 14 05:51:21 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Mon, 14 Jan 2008 08:51:21 -0500 Subject: [gutvol-d] Status of Magazine Articles In-Reply-To: References: Message-ID: <15cfa2a50801140551p2de7f0c8ie7ce70ba39fb6518@mail.gmail.com> On Jan 14, 2008 8:44 AM, Onorio Catenacci wrote: > I'd like to digitize articles from a magazine from 1929. I'm fairly > sure that the magazine did not pay its writers so I'm also fairly sure > that the writers would have retained copyright. However, there's no > way that I know of to be certain that the writers were not paid for Was it published in the US, and were the authors of US birth? If so, it may be clearable under Rule 6. The research is fairly long, but it is where the majority of the SF is coming from. Also, check to see if they have copyright notices.. some of the amateur magazines can be cleared that way. R C From Catenacci at Ieee.Org Mon Jan 14 06:07:34 2008 From: Catenacci at Ieee.Org (Onorio Catenacci) Date: Mon, 14 Jan 2008 09:07:34 -0500 Subject: [gutvol-d] Status of Magazine Articles In-Reply-To: <15cfa2a50801140551p2de7f0c8ie7ce70ba39fb6518@mail.gmail.com> References: <15cfa2a50801140551p2de7f0c8ie7ce70ba39fb6518@mail.gmail.com> Message-ID: On Jan 14, 2008 8:51 AM, Robert Cicconetti wrote: > On Jan 14, 2008 8:44 AM, Onorio Catenacci wrote: > > I'd like to digitize articles from a magazine from 1929. I'm fairly > > sure that the magazine did not pay its writers so I'm also fairly sure > > that the writers would have retained copyright. However, there's no > > way that I know of to be certain that the writers were not paid for > > Was it published in the US, and were the authors of US birth? If so, > it may be clearable under Rule 6. The research is fairly long, but it > is where the majority of the SF is coming from. Also, check to see if > they have copyright notices.. some of the amateur magazines can be > cleared that way. > Yes published in the US and yes authors were US born. If anyone doing the research for the SF stuff could point me to resources to check this, I would appreciate the help. When I looked through the FAQ, Rule 6 seemed most applicable to me but I have to confess it wasn't quite clear to me. I did look at table of contents for the magazine and I believe the publication info was at the bottom of that page. I didn't see a copyright notice but I may have missed it. I do believe that this magazine was mostly written by amateur authors interested in the hobby that the magazine was discussing. -- Onorio Catenacci III From greg at durendal.org Mon Jan 14 06:10:39 2008 From: greg at durendal.org (Greg Weeks) Date: Mon, 14 Jan 2008 09:10:39 -0500 (EST) Subject: [gutvol-d] Status of Magazine Articles In-Reply-To: References: Message-ID: On Mon, 14 Jan 2008, Onorio Catenacci wrote: > Is there any way that I can confirm the copyright status of these > individual articles? Assuming the authors were not paid for their > work, would the authors (or their estates) retain copyright? I mean > I know that copyright laws have changed since 1929 but I'd think the > work for hire aspects would have been the same even then. I've > managed to get hold of the family of one of the authors and they've > tentatively given me permission to reprint the article; I would have > asked for their permission even if I were sure the article is in the I've had some luck chasing down the current owners of the magazines and asking them. F&SF/Venture and Analog both responded to my inquiries. I've been ignored a lot too though. -- Greg Weeks http://durendal.org:8080/greg/ From greg at durendal.org Mon Jan 14 06:10:39 2008 From: greg at durendal.org (Greg Weeks) Date: Mon, 14 Jan 2008 09:10:39 -0500 (EST) Subject: [gutvol-d] Status of Magazine Articles In-Reply-To: References: Message-ID: On Mon, 14 Jan 2008, Onorio Catenacci wrote: > Is there any way that I can confirm the copyright status of these > individual articles? Assuming the authors were not paid for their > work, would the authors (or their estates) retain copyright? I mean > I know that copyright laws have changed since 1929 but I'd think the > work for hire aspects would have been the same even then. I've > managed to get hold of the family of one of the authors and they've > tentatively given me permission to reprint the article; I would have > asked for their permission even if I were sure the article is in the I've had some luck chasing down the current owners of the magazines and asking them. F&SF/Venture and Analog both responded to my inquiries. I've been ignored a lot too though. -- Greg Weeks http://durendal.org:8080/greg/ From greg at durendal.org Mon Jan 14 06:13:06 2008 From: greg at durendal.org (Greg Weeks) Date: Mon, 14 Jan 2008 09:13:06 -0500 (EST) Subject: [gutvol-d] Status of Magazine Articles In-Reply-To: References: <15cfa2a50801140551p2de7f0c8ie7ce70ba39fb6518@mail.gmail.com> Message-ID: On Mon, 14 Jan 2008, Onorio Catenacci wrote: > I did look at table of contents for the magazine and I believe the > publication info was at the bottom of that page. I didn't see a > copyright notice but I may have missed it. I do believe that this > magazine was mostly written by amateur authors interested in the hobby > that the magazine was discussing. It sounds like a good candidate for Rule 5, no copyright notice. I've cleared at least one issue of Astounding that way. Look the cover and a few pages around the table of contents page to be sure. I've never seen the copyright notice on a magazine any where else but the table of contents page. -- Greg Weeks http://durendal.org:8080/greg/ From grythumn at gmail.com Mon Jan 14 07:54:55 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Mon, 14 Jan 2008 10:54:55 -0500 Subject: [gutvol-d] #23961 copyrighted remove from catalog In-Reply-To: <9c6138c50801131844r378d0efeh58c14855e902b93f@mail.gmail.com> References: <9c6138c50801131844r378d0efeh58c14855e902b93f@mail.gmail.com> Message-ID: <15cfa2a50801140754g71d9ebf8v51badf4be13fd011@mail.gmail.com> How do you know this is a rule 1? Do you have the clearance key? R C On Jan 13, 2008 9:44 PM, Ricardo F Diogo wrote: > Etext #23961 (Manifesto Anti-Dantas) is not in the public domain in > the USA under the pre-1923 rule. Unless the editor gave his > permission, it must be removed from the catalog. 1916 is _NOT_ the > publication date. It's part of the text. From ricardofdiogo at gmail.com Mon Jan 14 08:15:12 2008 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Mon, 14 Jan 2008 16:15:12 +0000 Subject: [gutvol-d] #23961 copyrighted remove from catalog In-Reply-To: <15cfa2a50801140754g71d9ebf8v51badf4be13fd011@mail.gmail.com> References: <9c6138c50801131844r378d0efeh58c14855e902b93f@mail.gmail.com> <15cfa2a50801140754g71d9ebf8v51badf4be13fd011@mail.gmail.com> Message-ID: <9c6138c50801140815j6f4e6f76qe277aeeba94772eb@mail.gmail.com> 2008/1/14, Robert Cicconetti : > How do you know this is a rule 1? Do you have the clearance key? > > R C > I don't. You can call it intuition. Ricardo From gbnewby at pglaf.org Mon Jan 14 10:51:06 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Mon, 14 Jan 2008 10:51:06 -0800 Subject: [gutvol-d] #23961 copyrighted remove from catalog In-Reply-To: <9c6138c50801131844r378d0efeh58c14855e902b93f@mail.gmail.com> References: <9c6138c50801131844r378d0efeh58c14855e902b93f@mail.gmail.com> Message-ID: <20080114185106.GB12200@mail.pglaf.org> On Mon, Jan 14, 2008 at 02:44:57AM +0000, Ricardo F Diogo wrote: > Etext #23961 (Manifesto Anti-Dantas) is not in the public domain in > the USA under the pre-1923 rule. Unless the editor gave his > permission, it must be removed from the catalog. 1916 is _NOT_ the > publication date. It's part of the text. > > In cases where the publishing date is not prominent in Portuguese > Title Pages, you are much welcome to ask for my help before giving the > clearance line. > > Ricardo Thanks for your note, Ricardo. Please email copyright at pglaf.org (which goes to me & Juliet, who perform the clearances) for copyright info/inquiries. For this item, the library catalog says it was published in 1916. Here's a long link: http://opac.porbase.org/ipac20/ipac.jsp?session=T1981772V77E7.45462&profile=porbase&uri=full=3100024@!1405290@!2&ri=1&aspect=power&menu=search&source=192.168.0.17@!porbase&ipp=20&staffonly=&term=manifesto+anti&index=.TW&uindex=&oper=and&term=almada+negreiros&index=.AW&uindex=&aspect=power&menu=search&ri=1 When do you think it was published? -- Greg From Bowerbird at aol.com Mon Jan 14 11:43:02 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Jan 2008 14:43:02 EST Subject: [gutvol-d] a quick update on "the stare of the art of digitization" Message-ID: this is a quick update on "the stare of the art of digitization". first, i've determined that the two p.g. versions of "moby dick" have substantial differences, even though one of them was said to be based -- in part -- on the other, so i've decided to use them _both_ as comparison criterions against the one scanned by o.c.a. i'll let you know how that goes... *** also, i went out to gather material for the next replication of my research, and was pleasantly surprised by what i found... i looked up "books and culture", which was _the_ first book from the public-domain that google made publicly available. google now offers _five_ (count 'em, 5) scan-sets of this book! (those are the "full view" ones; they have "no preview" ones too.) 2 from umichigan, 2 from stanford, and 1 from the n.y. public... in addition, the o.c.a. offers 1 from the university of california and 1 from the university of toronto. (and i expect more from the u.c.) so we have plenty of versions with which to do comparisons on this, plus it indicates we might enjoy a similar plenitude on other books. with multiple sets of o.c.r. for one book, the results will be awesome! i knew this development would come to pass sooner or later, but it's nice to know that it's already happened; it's tremendously good news! now... if only google and the o.c.a. would get their _shit_together_ and stop dropping characters (like em-dashes and quote-marks), we could get to the job of cleaning the o.c.r. from millions of books. -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080114/269080ae/attachment-0001.htm From Bowerbird at aol.com Mon Jan 14 11:45:23 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Jan 2008 14:45:23 EST Subject: [gutvol-d] "stare of the art" -- ha ha! Message-ID: "stare of the art" -- ha ha! it's a good thing i'm not a proofreader! ;+) -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080114/a463327e/attachment.htm From Bowerbird at aol.com Mon Jan 14 13:24:31 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Jan 2008 16:24:31 EST Subject: [gutvol-d] a quick update on "the state of the art of digitization" Message-ID: i said: > google now offers _five_ (count 'em, 5) scan-sets of this book! > ... > in addition, the o.c.a. offers 1 from the university of california and > 1 from the university of toronto.? (and i expect more from the u.c.) i couldn't resist a quick comparison of the 2 sets of o.c.r. from the o.c.a. remember, one of these books was scanned at the university of toronto, and the other was scanned at the university of california. i did only a rudimentary clean-up on each, but it's already the case that there are >6,000 lines in common between them, and <200 different... i say, this comparison method is looking better and better all the time... :+) -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080114/29b1f8fe/attachment.htm From hart at pglaf.org Mon Jan 14 13:26:58 2008 From: hart at pglaf.org (Michael Hart) Date: Mon, 14 Jan 2008 13:26:58 -0800 (PST) Subject: [gutvol-d] !@! GUARDIAN says PG "World's Best Public Library Message-ID: http://www.guardian.co.uk/technology/2008/jan/14/project.gutenberg?gusrc=rss&feed=technology From Bowerbird at aol.com Tue Jan 15 13:11:11 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Jan 2008 16:11:11 EST Subject: [gutvol-d] happy birthday Message-ID: please. > I Have a Dream > > by Martin Luther King, Jr. > > excerpts from his speech of 28 August 1963, > at the Lincoln Memorial, in Washington D.C. > > > Let us not wallow in the valley of despair, > I say to you today, my friends. > > ...so even though we face the difficulties of today and tomorrow, > I still have a dream. > It is a dream deeply rooted in the American dream. > > I have a dream that one day this nation will rise up > and live out the true meaning of its creed: > "We hold these truths to be self-evident, > that all people are created equal." > > I have a dream that one day > on the red hills of Georgia, > the sons of former slaves and > the sons of former slave owners will be able to > sit down together at the table of brotherhood. > > I have a dream that one day even the state of Mississippi, > a state sweltering with the heat of injustice, > sweltering with the heat of oppression, > will be transformed into an oasis of freedom and justice. > > I have a dream that my four little children will > will one day live in a world where > they will not be judged by the color of their skin but > by the content of their character. > > I have a dream today!, > that little black boys > and black girls > will be able to > join hands with > little white boys > and white girls > as sisters and brothers. > > I have a dream today! happy birthday, dr. king... thank you for your dream! -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080115/b9012705/attachment.htm From julio.reis at tintazul.com.pt Wed Jan 16 13:48:53 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Wed, 16 Jan 2008 21:48:53 +0000 Subject: [gutvol-d] happy birthday In-Reply-To: References: Message-ID: <1200520134.7247.152.camel@abetarda> ... and no luck in getting the corresponding e-text back on-line? > > > I Have a Dream > > > > > > by Martin Luther King, Jr. From gbnewby at pglaf.org Wed Jan 16 14:40:23 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Wed, 16 Jan 2008 14:40:23 -0800 Subject: [gutvol-d] happy birthday In-Reply-To: <1200520134.7247.152.camel@abetarda> References: <1200520134.7247.152.camel@abetarda> Message-ID: <20080116224023.GA10033@mail.pglaf.org> On Wed, Jan 16, 2008 at 09:48:53PM +0000, J?lio Reis wrote: > ... and no luck in getting the corresponding e-text back on-line? > > > > > I Have a Dream > > > > > > > > by Martin Luther King, Jr. There was a legal case (not involving PG) in which it was determined that the speech is still under copyright protection. So, PG removed it from our archive (years ago). Under current copyright laws, it will be awhile before the speech enters the public domain. I don't think we ever asked, but the King estate does not seem keen on granting redistribution rights. -- Greg From Bowerbird at aol.com Wed Jan 16 15:32:49 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Jan 2008 18:32:49 EST Subject: [gutvol-d] happy birthday Message-ID: please. i'll send a copy of the whole thing to anyone who asks. or, you know, you can find it yourself, on the internet, just like i did. (i hit the first site google regurgitated.) imagine someone trying to tell me i can't copy a speech that was given to _me_, about a _dream_ given to _me_. ha! try an' stop me. free at last, free at last, thank god almighty, we're free at last... -bowerbird p.s. i kiss my girl any time i want, for as long as she likes. ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080116/4efcdaff/attachment.htm From schultzk at uni-trier.de Thu Jan 17 00:22:12 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Thu, 17 Jan 2008 09:22:12 +0100 Subject: [gutvol-d] happy birthday In-Reply-To: <20080116224023.GA10033@mail.pglaf.org> References: <1200520134.7247.152.camel@abetarda> <20080116224023.GA10033@mail.pglaf.org> Message-ID: <89624D09-10C0-45F9-82CE-B0838B6B5FFD@uni-trier.de> Hi There, I am not sure what that case was about, but a speech made in the public is public domain. Furthermore, King was then a public figure and therefore his speech is even more public. The speech as itself is public domain. What is not public domain as per se are publications thereof. So what was the source of the PG version? regards Keith. Am 16.01.2008 um 23:40 schrieb Greg Newby: > On Wed, Jan 16, 2008 at 09:48:53PM +0000, J?lio Reis wrote: >> ... and no luck in getting the corresponding e-text back on-line? >> >>>>> I Have a Dream >>>>> >>>>> by Martin Luther King, Jr. > > There was a legal case (not involving PG) in which it was > determined that the speech is still under copyright protection. > So, PG removed it from our archive (years ago). > > Under current copyright laws, it will be awhile before the speech > enters the public domain. I don't think we ever asked, but the > King estate does not seem keen on granting redistribution rights. > -- Greg > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From gbnewby at pglaf.org Thu Jan 17 10:38:11 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Thu, 17 Jan 2008 10:38:11 -0800 Subject: [gutvol-d] happy birthday In-Reply-To: <89624D09-10C0-45F9-82CE-B0838B6B5FFD@uni-trier.de> References: <1200520134.7247.152.camel@abetarda> <20080116224023.GA10033@mail.pglaf.org> <89624D09-10C0-45F9-82CE-B0838B6B5FFD@uni-trier.de> Message-ID: <20080117183810.GC27689@mail.pglaf.org> On Thu, Jan 17, 2008 at 09:22:12AM +0100, Schultz Keith J. wrote: > Hi There, > > I am not sure what that case was about, but > a speech made in the public is public domain. > Furthermore, King was then a public figure and > therefore his speech is even more public. > > The speech as itself is public domain. What is not > public domain as per se are publications thereof. Feel free (encouraged) to research the case and fight it out with the King estate on our behalf. Maybe it's public domain in other countries? > So what was the source of the PG version? Dunno. Michael Hart probably knows. -- Greg > regards > Keith. > > Am 16.01.2008 um 23:40 schrieb Greg Newby: > > >On Wed, Jan 16, 2008 at 09:48:53PM +0000, J?lio Reis wrote: > >>... and no luck in getting the corresponding e-text back on-line? > >> > >>>>> I Have a Dream > >>>>> > >>>>> by Martin Luther King, Jr. > > > >There was a legal case (not involving PG) in which it was > >determined that the speech is still under copyright protection. > >So, PG removed it from our archive (years ago). > > > >Under current copyright laws, it will be awhile before the speech > >enters the public domain. I don't think we ever asked, but the > >King estate does not seem keen on granting redistribution rights. > > -- Greg > > > >_______________________________________________ > >gutvol-d mailing list > >gutvol-d at lists.pglaf.org > >http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Thu Jan 17 11:28:09 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 17 Jan 2008 14:28:09 EST Subject: [gutvol-d] happy birthday Message-ID: greg said: > Dunno.? Michael Hart probably knows. the story i remember michael telling is that coretta scott king wanted to shake p.g. down. (i'm sure he used a more delicate phrasing...) -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080117/b6237651/attachment.htm From hart at pglaf.org Thu Jan 17 14:43:59 2008 From: hart at pglaf.org (Michael Hart) Date: Thu, 17 Jan 2008 14:43:59 -0800 (PST) Subject: [gutvol-d] happy birthday In-Reply-To: References: Message-ID: Just the opposite. . .The King Estate NEVER gave PG and grief, we just too the Dream speech down when we learned of the final judicial events, after several reverals in favor of CBS. mh On Thu, 17 Jan 2008, Bowerbird at aol.com wrote: > greg said: >> Dunno.? Michael Hart probably knows. > > the story i remember michael telling is that > coretta scott king wanted to shake p.g. down. > (i'm sure he used a more delicate phrasing...) > > -bowerbird > > > > ************** > Start the year off right. Easy ways to stay in shape. > > http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 > From Bowerbird at aol.com Thu Jan 17 15:05:57 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 17 Jan 2008 18:05:57 EST Subject: [gutvol-d] happy birthday Message-ID: michael said: > Just the opposite. . .The King Estate NEVER gave PG and grief, > we just too the Dream speech down when we learned of the > final judicial events, after several reverals in favor of CBS. oh, ok. my apologies for misremembering what transpired... but michael, since that is the case, you might want to see what the lawyer for the king estate said, when the suit was brought: > "Let's talk about who's being greedy," Beck said. > "We give the speech to schools for free. > We give the speech to non-profits and churches for free. > CBS -- they don't deny it -- charges $1,000 a minute for > a public school to have access to 'I Have a Dream.'" > http://www.cnn.com/US/9905/11/king.speech.02/index.html given p.g.'s status as a non-profit library, i'd guess you're safe... -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080117/a0b32296/attachment.htm From ricardofdiogo at gmail.com Fri Jan 18 18:02:35 2008 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Sat, 19 Jan 2008 02:02:35 +0000 Subject: [gutvol-d] Portuguese blog on ebooks Message-ID: <9c6138c50801181802q65e6d338ye12afce2313b8fc5@mail.gmail.com> Everyone has a blog. _I have the right to have one too!!_ Just in case there's some Portuguese-speaking volunteer around, here's my new blog on ebooks: http://ler-digital.blogspot.com/ Ricardo From julio.reis at tintazul.com.pt Sun Jan 20 07:09:04 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Sun, 20 Jan 2008 15:09:04 +0000 Subject: [gutvol-d] I Have A Dream In-Reply-To: References: Message-ID: <1200841744.9965.31.camel@abetarda> So how about someone asking the King Estate? I'd be happy to translate it into Portuguese. J?lio. > given p.g.'s status as a non-profit library, i'd guess you're safe... From prosfilaes at gmail.com Sun Jan 20 07:54:05 2008 From: prosfilaes at gmail.com (David Starner) Date: Sun, 20 Jan 2008 10:54:05 -0500 Subject: [gutvol-d] I Have A Dream In-Reply-To: <1200841744.9965.31.camel@abetarda> References: <1200841744.9965.31.camel@abetarda> Message-ID: <6d99d1fd0801200754t704eb98bs19db19cb37135865@mail.gmail.com> On Jan 20, 2008 10:09 AM, J?lio Reis wrote: > So how about someone asking the King Estate? I'd be happy to translate > it into Portuguese. (a) The reason we know it's copyrighted is because the King Estate spent lots of money litigating it. There are enough examples of the King Estate being restrictive on reuse to negate the interest most of us might have in asking. (b) I find it highly unlikely that even if they gave us permission to host the speech, if they would let us make derivative works including translation. From f.fuchs at gmx.net Sun Jan 20 12:06:30 2008 From: f.fuchs at gmx.net (Franz Fuchs) Date: Sun, 20 Jan 2008 21:06:30 +0100 Subject: [gutvol-d] YouTube: A Librarian Reviews the XO Laptop References: Message-ID: <001801c85b9f$fd0eea80$8c00000a@frf> http://youtube.com/watch?v=quJIAucDOU0 --- I wish all the best for the One Laptop Per Child Foundation. Nicholas Negroponte is a true visionary and I applaud his efforts to help provide an education to children in developing nations. I wonder if he and others working with the OLPC realize how much they are educating adults in this nation as we partner with them to help bridge the gap in the digital divide http://librarianbydesign.blogspot.com/ Added: January 13, 2008 --- Best regards FrF From Bowerbird at aol.com Fri Jan 25 11:09:32 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Jan 2008 14:09:32 EST Subject: [gutvol-d] author pirates his own books Message-ID: author pirates his own books, and increases sales dramatically: > http://torrentfreak.com/alchemist-author-pirates-own-books-080124/ -bowerbird ************** Biggest Grammy Award surprises of all time on AOL Music. (http://music.aol.com/grammys/pictures/never-won-a-grammy?NCID=aolcmp003000000025 48) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080125/9e727373/attachment.htm From ricardofdiogo at gmail.com Fri Jan 25 12:53:29 2008 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Fri, 25 Jan 2008 20:53:29 +0000 Subject: [gutvol-d] author pirates his own books In-Reply-To: References: Message-ID: <9c6138c50801251253k1166141fo96c4e7b2a127b35@mail.gmail.com> Yes. He also told me that PG is allowed to distribute his books. I'm waiting for an answer from Greg so that Coelho can send the permission letter to PG. Ricardo 2008/1/25, Bowerbird at aol.com : > author pirates his own books, and increases sales dramatically: > > > http://torrentfreak.com/alchemist-author-pirates-own-books-080124/ > > -bowerbird > > > > ************** > Biggest Grammy Award surprises of all time on AOL Music. > (http://music.aol.com/grammys/pictures/never-won-a-grammy?NCID=aolcmp00300000002548) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From hart at pglaf.org Fri Jan 25 13:15:32 2008 From: hart at pglaf.org (Michael Hart) Date: Fri, 25 Jan 2008 13:15:32 -0800 (PST) Subject: [gutvol-d] author pirates his own books In-Reply-To: <9c6138c50801251253k1166141fo96c4e7b2a127b35@mail.gmail.com> References: <9c6138c50801251253k1166141fo96c4e7b2a127b35@mail.gmail.com> Message-ID: You can have an answer from me. Go for it! And let's be sure to mention all this INSIDE the books. Perhaps get permission from the suthor of one or twoarticles that mention the whole process. . . .? Michael On Fri, 25 Jan 2008, Ricardo F Diogo wrote: > Yes. He also told me that PG is allowed to distribute his books. I'm > waiting for an answer from Greg so that Coelho can send the permission > letter to PG. > > Ricardo > > 2008/1/25, Bowerbird at aol.com : >> author pirates his own books, and increases sales dramatically: >> > >> http://torrentfreak.com/alchemist-author-pirates-own-books-080124/ >> >> -bowerbird >> >> >> >> ************** >> Biggest Grammy Award surprises of all time on AOL Music. >> (http://music.aol.com/grammys/pictures/never-won-a-grammy?NCID=aolcmp00300000002548) >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d at lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d >> >> > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From ricardofdiogo at gmail.com Fri Jan 25 13:24:47 2008 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Fri, 25 Jan 2008 21:24:47 +0000 Subject: [gutvol-d] author pirates his own books In-Reply-To: References: <9c6138c50801251253k1166141fo96c4e7b2a127b35@mail.gmail.com> Message-ID: <9c6138c50801251324o4edff427j7c7da588cc1e905a@mail.gmail.com> Can I ask him to send you the following letter? Michael S. Hart Founder, Project Gutenberg 405 West Elm Street Urbana IL, 61801-3231, USA Dear Project Gutenberg: It gives me pleasure to grant Project Gutenberg perpetual, worldwide, non-exclusive rights to distribute all my books in electronic form through Project Gutenberg Web sites, CDs or other current and future formats. No royalties are due for these rights. The same applies to end users. Sincerely, Will it be enough? Ricardo 2008/1/25, Michael Hart : > > > > You can have an answer from me. > > Go for it! > > > And let's be sure to mention all this INSIDE the books. > > Perhaps get permission from the suthor of one or twoarticles > that mention the whole process. . . .? > > > Michael > > On Fri, 25 Jan 2008, Ricardo F Diogo wrote: > > > Yes. He also told me that PG is allowed to distribute his books. I'm > > waiting for an answer from Greg so that Coelho can send the permission > > letter to PG. > > > > Ricardo > > > > 2008/1/25, Bowerbird at aol.com : > >> author pirates his own books, and increases sales dramatically: > >> > > >> http://torrentfreak.com/alchemist-author-pirates-own-books-080124/ > >> > >> -bowerbird > >> > >> > >> > >> ************** > >> Biggest Grammy Award surprises of all time on AOL Music. > >> (http://music.aol.com/grammys/pictures/never-won-a-grammy?NCID=aolcmp00300000002548) > >> _______________________________________________ > >> gutvol-d mailing list > >> gutvol-d at lists.pglaf.org > >> http://lists.pglaf.org/listinfo.cgi/gutvol-d > >> > >> > > _______________________________________________ > > gutvol-d mailing list > > gutvol-d at lists.pglaf.org > > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Mon Jan 28 13:08:32 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 28 Jan 2008 16:08:32 EST Subject: [gutvol-d] talk about a thin computer! Message-ID: talk about a thin computer! > http://www.youtube.com/watch?v=i6yBo9NPkCQ -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080128/68ae4e57/attachment.htm From Bowerbird at aol.com Tue Jan 29 15:29:40 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 29 Jan 2008 18:29:40 EST Subject: [gutvol-d] more on "doing things the hard way" Message-ID: remember the project over at d.p. where they're attempting to determine a "confidence in page accuracy" computation, to tell if a page is accurate enough, or needs more proofing? well, i'm sure they're hard at work, cranking their numbers, but meanwhile, here comes in a good observation analysis: > The P1/P1/P2 - P3 skips are filtering through to F2 now, > and look to be in pretty good shape. Thumbs Up > I think it is the best intitiative we have had for a while. > http://www.pgdp.net/phpBB2/viewtopic.php?p=417654#417654 that's right. sending text through 2 rounds of p1, then a p2, results in clean text, probably not all that much different from a p1-p2-p3 routing. 3 rounds will usually give you clean text, _even_without_any_"proven-talent"_p3_proofer_in_the_mix_... whether you can stop after 2, or even 1, is what the question is. but p1 proofers are _plentiful_, so why not just do p1-p1-p1? and even just p1-p1 _if_ the second proofer makes no change? oh yeah, then you'd just be following my rule about consensus... no complex statistics necessary, just a simple test of equivalence. you know people _want_ things to be difficult, _like_ them difficult, when they won't even _try_ the simple way first... :+) -bowerbird p.s. oh yeah, and the best way to obtain great accuracy is to program a checker that flags _all_ errors, and _only_ errors... a hint: it's not as impossible as you might think at first glance. ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080129/b881b704/attachment.htm From creeva at gmail.com Tue Jan 29 16:08:53 2008 From: creeva at gmail.com (Brent Gueth) Date: Tue, 29 Jan 2008 19:08:53 -0500 Subject: [gutvol-d] more on "doing things the hard way" In-Reply-To: References: Message-ID: <2510ddab0801291608v2547fe1atd3e5313454354a81@mail.gmail.com> People reinvent the wheel to give them a place in hierarchy Or Bureaucracy makes the world go round. Or People think there is always a better way Or finally, Show a geek how you do something and he'll always find a way to show why his method is better. I'm not taking sides on either angle of this argument - but it's been going on for awhile - what does it take to get a consensus and move on from there. Someone just needs to say this is the way it's going to be done - then we can all move on, quietly gripe and take potshots while work is getting accomplished versus debate. On Jan 29, 2008 6:29 PM, wrote: > remember the project over at d.p. where they're attempting > to determine a "confidence in page accuracy" computation, > to tell if a page is accurate enough, or needs more proofing? > > well, i'm sure they're hard at work, cranking their numbers, > but meanwhile, here comes in a good observation analysis: > > > The P1/P1/P2 - P3 skips are filtering through to F2 now, > > and look to be in pretty good shape. Thumbs Up > > I think it is the best intitiative we have had for a while. > > http://www.pgdp.net/phpBB2/viewtopic.php?p=417654#417654 > > that's right. sending text through 2 rounds of p1, then a p2, > results in clean text, probably not all that much different from > a p1-p2-p3 routing. 3 rounds will usually give you clean text, > _even_without_any_"proven-talent"_p3_proofer_in_the_mix_... > > whether you can stop after 2, or even 1, is what the question is. > > but p1 proofers are _plentiful_, so why not just do p1-p1-p1? > and even just p1-p1 _if_ the second proofer makes no change? > > oh yeah, then you'd just be following my rule about consensus... > > no complex statistics necessary, just a simple test of equivalence. > > you know people _want_ things to be difficult, _like_ them difficult, > when they won't even _try_ the simple way first... :+) > > -bowerbird > > p.s. oh yeah, and the best way to obtain great accuracy is to > program a checker that flags _all_ errors, and _only_ errors... > a hint: it's not as impossible as you might think at first glance. > > > > ************** > Start the year off right. Easy ways to stay in shape. > http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From Morasch at aol.com Wed Jan 30 13:29:24 2008 From: Morasch at aol.com (Morasch at aol.com) Date: Wed, 30 Jan 2008 16:29:24 EST Subject: [gutvol-d] more on "doing things the hard way" Message-ID: brent said: > but it's been going on for awhile i'll say! :+) and the stray comment i make here every once in a while is just a very small tip of the big iceberg of discussion over there. i can point you to literally _dozens_ of different threads, where many different proposals have been made, and some executed -- threads going _dozens_ of pages, at 15 messages/page -- so these projects are discussed ad infinitum and then forgotten, at least until a similar thread raises its head years down the line. they've talked this issue to death, and basically gotten nowhere, and it's extremely frustrating to them, to a great many of them... that's why it's so comical when you can see the answer is so easy. > what does it take to get a consensus and move on from there. none of them seem to know that, either, and say so, frequently... basically, it means juliet giving the go-ahead, but she's confused, hopelessly confused, and that means everyone ends up confused. but yeah, she's who stated that this "confidence in page" thingee needed to be calculable before d.p. can go to a roundless mode. it was at that time that i made my "that's not really the case" post. i would have said it there -- said it _again_, that is, since i said it there many times before -- except i was _banned_ from speaking. so i said it here instead. (thanks, michael, for freedom to speak.) not many of the people that are off on this useless quest are here, though, except for carlo (who's the main leader of the uselessness), so it doesn't matter all that much. just me feeling a need to say it... > then we can all move on, quietly gripe and take potshots > while work is getting accomplished versus debate. it's not quite so easy to say that "work is getting accomplished"... of course _some_ work is being "accomplished", but the question is "at what expense in human time and energy?" if the process wastes a huge amount of resources, and could be massively more efficient (getting more "accomplished" and creating more happiness as well), shouldn't someone who can _recognize_that_ step up and speak out? i certainly believe so, and believe so strongly. so when that person is _me_, i'm gonna step up and speak out. and that's how it's gonna be. but i'm sure glad no one has been making a federal case out of it lately. i just wanna put myself on the record, so when d.p. eventually wises up, an objective observer sees that they should've listened to me originally. -bowerbird p.s. i'm also trying to inspire some thinking at a much higher level. perhaps you would like it more if i just pitched posts at that altitude? for example, since people are extending the effort to try to determine how to predict if a page is accurate-enough or not, what if it appears that -- with just a bit more effort -- they could obtain a useful answer? then, even though it was a big mistake to _start_out_ on that pathway, should they nevertheless continue? now _that's_ an interesting query! i would say _yes_, they should, even though i believe they won't succeed. but i could have instead posted a message that considered this question. would you have preferred that? or, to take it even further, let's ask ourselves what kind of system they'll employ in order to _test_the_efficacy_ of their predictor, if they do use it. i would argue that they will need to utilize some kind of _infrastructure_ that collects error-reports downstream and feeds-back to their predictor. otherwise, their predictor could be flawed, and they would never find out. but they haven't thought that far ahead, and realized they need to build it. moreover, if they _do_ create a downstream error-reporting system, then _that_ could be considered their "last line of defense", and thus there is a good reason to propose that they don't even _need_ a predictor machine. and, in this regard, it's interesting to note that they have not made use of their closest proxy to that variable now, namely the errors that are being reported by their "smooth-readers", who read a final text _for_content_. these smooth-readers do find errors -- even after 3 rounds of proofing and 2 rounds of formatting, yes! -- and it would be extremely cogent to ascertain the underlying nature of such errors, if there happens to be one. so, would you prefer having a discussion that was pitched at _that_ level? i'm a self-starter. i'm happy -- quite happy -- to post without any replies. but, you know, if anyone wants to have a _conversation_, i can do that too. just let me know at what level of the mountain you want to pitch the tent... ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080130/5f8c42ce/attachment.htm From klofstrom at gmail.com Wed Jan 30 18:22:19 2008 From: klofstrom at gmail.com (Karen Lofstrom) Date: Wed, 30 Jan 2008 16:22:19 -1000 Subject: [gutvol-d] more on "doing things the hard way" In-Reply-To: References: Message-ID: <1e8e65080801301822t4d66de0crea0285523af0c5ad@mail.gmail.com> On Jan 30, 2008 11:29 AM, wrote: > i'm a self-starter. i'm happy -- quite happy -- to post without any replies. So much so that you change your username to avoid killfiles. That's not someone who doesn't mind whether anyone is listening or not; that's someone intent to annoy. Your comments on the DP process are worthless because you've never done any higher-round proofing, formatting, or PPing. You've done a few pages a few years ago ... and you're an expert? It is to laugh. Any sensible list moderator would have banned you long ago. Now into the killfile with you. -- Karen Lofstrom From schultzk at uni-trier.de Thu Jan 31 01:40:22 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Thu, 31 Jan 2008 10:40:22 +0100 Subject: [gutvol-d] more on "doing things the hard way" In-Reply-To: References: Message-ID: <6A1B3201-C914-41A4-A1C6-767CEBAC2C16@uni-trier.de> Hi there, This post somewhat confuses me. 1) In the middle it is signed Bowerbird 2) a copy is sent to bowerbird 3) the style is not bowerbird 4) the content does seem to indicate bowerbird 5) it is not bowerbird way to hide himself So is the post below actually from bowerbird. Anyway. As with any open project anarchy tends to rule. That is there are long discussions that do not seem to lead to meaningful end. That is the implementation of the discussed. More than often only a very small part actually gets implemented, since a concensus can not be reach. Such projects as DP and PG need some kind of authorative manager. Without such manager the evolution of the project takes time. If DP has a problem with you then stay away. I do as I do not like their way of doings things. But that is just my opinion. regards Keith Am 30.01.2008 um 22:29 schrieb Morasch at aol.com: > brent said: > > but it's been going on for awhile > > i'll say! :+) > > and the stray comment i make here every once in a while is > just a very small tip of the big iceberg of discussion over there. > > i can point you to literally _dozens_ of different threads, where > many different proposals have been made, and some executed > -- threads going _dozens_ of pages, at 15 messages/page -- > so these projects are discussed ad infinitum and then forgotten, > at least until a similar thread raises its head years down the line. > > they've talked this issue to death, and basically gotten nowhere, > and it's extremely frustrating to them, to a great many of them... > > that's why it's so comical when you can see the answer is so easy. > > > > what does it take to get a consensus and move on from there. > > none of them seem to know that, either, and say so, frequently... > > basically, it means juliet giving the go-ahead, but she's confused, > hopelessly confused, and that means everyone ends up confused. > > but yeah, she's who stated that this "confidence in page" thingee > needed to be calculable before d.p. can go to a roundless mode. > > it was at that time that i made my "that's not really the case" post. > > i would have said it there -- said it _again_, that is, since i > said it > there many times before -- except i was _banned_ from speaking. > > so i said it here instead. (thanks, michael, for freedom to speak.) > > not many of the people that are off on this useless quest are here, > though, except for carlo (who's the main leader of the uselessness), > so it doesn't matter all that much. just me feeling a need to say > it... > > > > then we can all move on, quietly gripe and take potshots > > while work is getting accomplished versus debate. > > it's not quite so easy to say that "work is getting accomplished"... > > of course _some_ work is being "accomplished", but the question is > "at what expense in human time and energy?" if the process wastes > a huge amount of resources, and could be massively more efficient > (getting more "accomplished" and creating more happiness as well), > shouldn't someone who can _recognize_that_ step up and speak out? > > i certainly believe so, and believe so strongly. so when that > person is > _me_, i'm gonna step up and speak out. and that's how it's gonna be. > > but i'm sure glad no one has been making a federal case out of it > lately. > > i just wanna put myself on the record, so when d.p. eventually > wises up, > an objective observer sees that they should've listened to me > originally. > > -bowerbird > > p.s. i'm also trying to inspire some thinking at a much higher level. > perhaps you would like it more if i just pitched posts at that > altitude? > > for example, since people are extending the effort to try to determine > how to predict if a page is accurate-enough or not, what if it appears > that -- with just a bit more effort -- they could obtain a useful > answer? > then, even though it was a big mistake to _start_out_ on that pathway, > should they nevertheless continue? now _that's_ an interesting query! > i would say _yes_, they should, even though i believe they won't > succeed. > but i could have instead posted a message that considered this > question. > would you have preferred that? > > or, to take it even further, let's ask ourselves what kind of > system they'll > employ in order to _test_the_efficacy_ of their predictor, if they > do use it. > i would argue that they will need to utilize some kind of > _infrastructure_ > that collects error-reports downstream and feeds-back to their > predictor. > otherwise, their predictor could be flawed, and they would never > find out. > but they haven't thought that far ahead, and realized they need to > build it. > moreover, if they _do_ create a downstream error-reporting system, > then > _that_ could be considered their "last line of defense", and thus > there is a > good reason to propose that they don't even _need_ a predictor > machine. > and, in this regard, it's interesting to note that they have not > made use of > their closest proxy to that variable now, namely the errors that > are being > reported by their "smooth-readers", who read a final text > _for_content_. > these smooth-readers do find errors -- even after 3 rounds of proofing > and 2 rounds of formatting, yes! -- and it would be extremely > cogent to > ascertain the underlying nature of such errors, if there happens to > be one. > so, would you prefer having a discussion that was pitched at _that_ > level? > > i'm a self-starter. i'm happy -- quite happy -- to post without > any replies. > > but, you know, if anyone wants to have a _conversation_, i can do > that too. > just let me know at what level of the mountain you want to pitch > the tent... > > > > ************** > Start the year off right. Easy ways to stay in shape. > http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080131/64aafefa/attachment-0001.htm From schultzk at uni-trier.de Thu Jan 31 01:56:07 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Thu, 31 Jan 2008 10:56:07 +0100 Subject: [gutvol-d] more on "doing things the hard way" In-Reply-To: <1e8e65080801301822t4d66de0crea0285523af0c5ad@mail.gmail.com> References: <1e8e65080801301822t4d66de0crea0285523af0c5ad@mail.gmail.com> Message-ID: Hi, What is in a username. I have about 8. I do tend to use just 2 E-mail addresses. The person did sign in the middle. Though as I notice in another reply I am not quite convinced it is actually Bowerbird. I do not know if you have ever done system analysis or not, but it can be done and is without practical experience. It is highly theorectical. I wonder if Einstein had practical experience with relativity. O.K. I know he did not. Yet, somehow he was right. At least that is what physics tells us today. No. No. No. Bowerbird does not come close to Einstein. He does have his caveats. I also tend to disagree with him and enjoy the discussions, because is willing to debate. regards Keith. Am 31.01.2008 um 03:22 schrieb Karen Lofstrom: > On Jan 30, 2008 11:29 AM, wrote: > >> i'm a self-starter. i'm happy -- quite happy -- to post without >> any replies. > > So much so that you change your username to avoid killfiles. That's > not someone who doesn't mind whether anyone is listening or not; > that's someone intent to annoy. Your comments on the DP process are > worthless because you've never done any higher-round proofing, > formatting, or PPing. You've done a few pages a few years ago ... and > you're an expert? It is to laugh. > > Any sensible list moderator would have banned you long ago. Now into > the killfile with you. > > -- > Karen Lofstrom > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Thu Jan 31 10:32:29 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 31 Jan 2008 13:32:29 EST Subject: [gutvol-d] more on "doing things the hard way" Message-ID: keith- please accept my apologies. i didn't mean to confuse anyone. i just happened to e-mail that post from the wrong log-in... so here it is again, so you know it's "genuine"... thank you... -bowerbird ======================================= brent said: > but it's been going on for awhile i'll say! :+) and the stray comment i make here every once in a while is just a very small tip of the big iceberg of discussion over there. i can point you to literally _dozens_ of different threads, where many different proposals have been made, and some executed -- threads going _dozens_ of pages, at 15 messages/page -- so these projects are discussed ad infinitum and then forgotten, at least until a similar thread raises its head years down the line. they've talked this issue to death, and basically gotten nowhere, and it's extremely frustrating to them, to a great many of them... that's why it's so comical when you can see the answer is so easy. > what does it take to get a consensus and move on from there. none of them seem to know that, either, and say so, frequently... basically, it means juliet giving the go-ahead, but she's confused, hopelessly confused, and that means everyone ends up confused. but yeah, she's who stated that this "confidence in page" thingee needed to be calculable before d.p. can go to a roundless mode. it was at that time that i made my "that's not really the case" post. i would have said it there -- said it _again_, that is, since i said it there many times before -- except i was _banned_ from speaking. so i said it here instead. (thanks, michael, for freedom to speak.) not many of the people that are off on this useless quest are here, though, except for carlo (who's the main leader of the uselessness), so it doesn't matter all that much. just me feeling a need to say it... > then we can all move on, quietly gripe and take potshots > while work is getting accomplished versus debate. it's not quite so easy to say that "work is getting accomplished"... of course _some_ work is being "accomplished", but the question is "at what expense in human time and energy?" if the process wastes a huge amount of resources, and could be massively more efficient (getting more "accomplished" and creating more happiness as well), shouldn't someone who can _recognize_that_ step up and speak out? i certainly believe so, and believe so strongly. so when that person is _me_, i'm gonna step up and speak out. and that's how it's gonna be. but i'm sure glad no one has been making a federal case out of it lately. i just wanna put myself on the record, so when d.p. eventually wises up, an objective observer sees that they should've listened to me originally. -bowerbird p.s. i'm also trying to inspire some thinking at a much higher level. perhaps you would like it more if i just pitched posts at that altitude? for example, since people are extending the effort to try to determine how to predict if a page is accurate-enough or not, what if it appears that -- with just a bit more effort -- they could obtain a useful answer? then, even though it was a big mistake to _start_out_ on that pathway, should they nevertheless continue? now _that's_ an interesting query! i would say _yes_, they should, even though i believe they won't succeed. but i could have instead posted a message that considered this question. would you have preferred that? or, to take it even further, let's ask ourselves what kind of system they'll employ in order to _test_the_efficacy_ of their predictor, if they do use it. i would argue that they will need to utilize some kind of _infrastructure_ that collects error-reports downstream and feeds-back to their predictor. otherwise, their predictor could be flawed, and they would never find out. but they haven't thought that far ahead, and realized they need to build it. moreover, if they _do_ create a downstream error-reporting system, then _that_ could be considered their "last line of defense", and thus there is a good reason to propose that they don't even _need_ a predictor machine. and, in this regard, it's interesting to note that they have not made use of their closest proxy to that variable now, namely the errors that are being reported by their "smooth-readers", who read a final text _for_content_. these smooth-readers do find errors -- even after 3 rounds of proofing and 2 rounds of formatting, yes! -- and it would be extremely cogent to ascertain the underlying nature of such errors, if there happens to be one. so, would you prefer having a discussion that was pitched at _that_ level? i'm a self-starter. i'm happy -- quite happy -- to post without any replies. but, you know, if anyone wants to have a _conversation_, i can do that too. just let me know at what level of the mountain you want to pitch the tent... ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080131/1f8fba4a/attachment.htm From Bowerbird at aol.com Thu Jan 31 11:12:17 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 31 Jan 2008 14:12:17 EST Subject: [gutvol-d] more on "doing things the hard way" Message-ID: keith said: > No. No. No. Bowerbird does not come close to Einstein. well, my hair _is_ quite beautiful, but hey, _nothing_ could top his. that guy had the greatest hair ever. as for zora (karen lofstrom), well, she acts like a dingbat... (notice i did _not_ say she _is_ a dingbat, because that would be an ad hominem argument; i am only speaking about her _behavior_, which is within her capacity for _change_...) the argument that you need to be _inside_ a tar-pit to know it's a tar-pit is laughably silly... a lot of things are much more easy to see from an _objective_ perspective, even a distant one. and i have a ton of experience digitizing text outside of d.p. i've done dc-10 flight manuals, text-books, magazines, poetry, novels, and a host of other stuff, including public-domain books... if she would have paid attention, zora would even know that i've analyzed a book she processed; i documented _dozens_ of errors -- embarrassing ones -- inside it, errors that remain to this very day. i clean up and format digitized text for entertainment, like other people will do crossword puzzles or sudoku. heck, i did lessig's "freedom of ideas" last week, just for the fun of it, which was nice because i had _clean_text_, since i just copied it out of the .pdf, but was also a bear because i had to rework the formatting extensively, since i copied the text out of a .pdf... > http://z-m-l.com/go/llfoi/llfoif001.html the thing is, when you do something _for_fun_, you simply won't let yourself get trapped in a tar-pit... that i make suggestions as to how d.p. could get itself out of its current tar-pit is _an_act_of_love_, because i highly value the individuals who are volunteering time and energy in support of the public-domain. the failure of the d.p. "powers that be" -- not to mention many of those volunteers -- is on _them_, and not my responsibility. it's neither here nor there, though, because within a few years time, anyone will be able to digitize any book they want, by simply dropping the o.c.r. results onto a clean-up program and answering a few questions to resolve ambiguity, so d.p. will either morph to that reality, or die... -bowerbird ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080131/379d6ab2/attachment.htm From Bowerbird at aol.com Thu Jan 31 11:23:32 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 31 Jan 2008 14:23:32 EST Subject: [gutvol-d] two overarching thoughts on a roundless system of proofing Message-ID: please let me repeat this post, from saturday, december 15, 2007 at 1:29pm: i'll have a lot more later -- it's written already, but i think i will wait until monday to send it to this list -- but here are two overarching thoughts about implementing a _roundless_ system of proofing... (in case you're wondering why, this is a topic that is being discussed over on the d.p. forums, presently, and often over the past few years. and it's a shame it never moves past the discussion phase, since the current system -- where _every_page_of_every_book_ is slated to go through a specific number of rounds -- is grossly inefficient, and has led to a huge waste of time and energy, plus endless discussions and a wide array of experiments to overcome its obvious shortcomings. however, the discussion is marred by a bunch of people who simply don't know what they're talking about, and by the fact that no one over there seems to be able to separate the wheat from the chaff...) anyway, here are those two overarching thoughts. 1.? it's unnecessary to "formulate some kind of metric" to inform you when a specific page can be considered "finished".? it is _done_ when a certain number of people -- say 2 to 4 -- can't find any errors in it. at that point, even if there _are_ still errors in it, it has simply become unproductive to schedule yet _another_ set of eyes to look for them... but, for the vast majority of pages, there just won't be any errors left. you don't have to believe me.? just try it -- as the simplest thing that _might_ work -- and you will happily discover it does indeed work... 2.? it's unnecessary to "formulate some kind of metric" to inform you about the proofing skills of each volunteer.? it's easy enough to use the obvious measures to determine a score, but it's unnecessary to _use_ that score in order to assign pages to the proofer, since the measure of whether a page is "finished" or not is impervious to the skill levels of the proofers.? if 2-4 "average" proofers find no errors left on a page, then the odds are that a "great" proofer won't either. and -- once again -- you don't have to believe me that this is true; try it -- as the simplest thing that _might_ work -- and find it does... in other words, don't make it more complicated that it has to be... -bowerbird p.s. thank you... ************** Start the year off right. Easy ways to stay in shape. http://body.aol.com/fitness/winter-exercise?NCID=aolcmp00300000002489 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080131/3ea435fd/attachment-0001.htm