From Bowerbird at aol.com Mon Mar 1 10:35:40 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 1 Mar 2010 13:35:40 EST Subject: [gutvol-d] or a watch to take out of it Message-ID: <1d231.6af0f3f0.38bd62fc@aol.com> ok, so there's a new e-book viewer-program in town. liza daly, who is affiliated with the o'reilly folks, has brought out "ibisreader" for your reading enjoyment. so i went to check it out... at ibisreader.com, i clicked "get started" and then -- at "add a book" -- "feedbooks: popular public domain". the 4th book down is "alice's adventures in wonderland", and since the movie is coming out this friday, i got that. feedbooks, as you probably know, is a site that takes e-books from various sources, like project gutenberg, and makes some very nice-looking versions of them... since feedbooks re-works the books, they don't have as many books as some of the other sites, but they are preferred by many people because their books look nice. so i start looking at the text, and i find this: > There was nothing so very remarkable in that; > nor did Alice think it so very much out of the way > to hear the Rabbit say to itself ?Oh dear! Oh dear! > I shall be too late!? (when she thought it over > afterwards, it occurred to her that she ought to > have wondered at this, but at the time it all seemed > quite natural); but when the Rabbit actually > took a watch out of its waistcoat-pocket, > and looked at it, and then hurried on, > Alice started to her feet, for it flashed across her mind > that she had never before seen a rabbit with > either a waistcoat-pocket, or a watch to take out of it, > and, burning with curiosity, she ran across the field after it, > and was just in time to see it pop down a large rabbit-hole > under the hedge. well, gee. if you're intimately familiar with this book, you know > took a watch out of its waistcoat-pocket, is a phrase that is _italicized_ in the book. but not on this file... here's where you can see a copy of the original: > http://www.archive.org/stream/alicesadventur00carr#page/2/mode/2up you'll have to take my word for it that it's not italicized in the feedbooks copy that is being used by ibisreader, since i don't see any convenient way for me to link to a specific page there. (but it's book #22 from feedbooks, if you wanna look yourself.) it's pretty clear that what has happened here is that feedbooks has taken pg#11 and used it as its source. what is a sad thing, because -- even after all the years it coulda been "improved" -- pg#11 _still_ doesn't have proper italics in it. it was "updated" in 2005 (leaving no trace behind) and then once again in 2008 (when an .html version was added). but it still has zero italics. > http://www.gutenberg.org/files/11/11-h/11-h.htm > http://www.gutenberg.org/files/11/11.txt the p.g. version _does_ have italics rendered as all-uppercase. so there is _some_ indication of them. but this is ambiguous, since there are places in the book where uppercase is used too: > http://www.archive.org/stream/alicesadventur00carr#page/4/mode/2up (see the reference to "orange marmalade", in all-uppercase.) if i would've been feedbooks, i would've converted _all_ of the uppercased words to italics. but of course then they would've been changing things like chapter headers and first-words too. and of course, feedbooks could've _left_ words in all-uppercase. i don't know why they didn't, but i assume it's because they take _pride_ in their typography, and all-uppercase looks like crap... (but it's true that the tell-tale examples of _actual_ uppercase, namely "orange marmalade", "drink me", and "eat me", are all still rendered in uppercase in feedbooks#22, so i am stumped.) and yes, folks, i know there are other versions of "alice" posted, including some with italics correctly specified. so let us look... pg#19033 _does_ have italics in it, but it does _not_ italicize the phrase about the rabbit, and his watch, and his pocket... whether it's a version-difference -- it _is_ another version -- or whether it's just a digitization mistake, i simply don't know. (and since this is clearly not the explanation for the feedbooks discrepancies, i have no interest in determining the reasons.) pg#928 -- an .html version only -- _does_ have italics. yay! pg#28885 also has the italics, and it has the images as well! although sometimes things don't go right, as shown here: > http://z-m-l.com/misc/alice-glitch.png the feedbooks version has _no_ italics at all, as far as i can see, not even the different set of italics from pg#19033. so, at first, i had thought they'd used pg#11 as their source text, but now, i'm not so sure. (they have some strange contractions, which could indicate that they might've done their own digitization.) at any rate, the point still stands that pg#11 lacks proper italics. all-uppercase is _not_ a substitute. and since the italics _have_ been done -- in pg#928 -- the changes should be incorporated into pg#11. the presence of other versions, at higher numbers, won't change the fact that pg#11 is considered as "the original." so, c'mon, let's see someone who talks about pg/dp "quality" and "incremental improvement" actually _back_up_ the claim. let's get this classic and legendary e-text cleaned up now, ok? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 1 10:55:59 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 1 Mar 2010 13:55:59 EST Subject: [gutvol-d] nathan hale -- 001 Message-ID: <1ec34.2b548214.38bd67bf@aol.com> ok, in discussing one of the books on rfrank's roundless test-site, i remarked that the scans were badly done, uncharacteristic of roger. much more in keeping with his typical level of quality are the scans from a new book on the site, a biography about nathan hale, so i've scraped and remounted those scans on my site, and will do the book. > http://z-m-l.com/go/nhale/nhalep123.html > http://z-m-l.com/go/nhale/nhale.zml as i've been doing lately, the last page of the book shows the changes that i made to the file to clean it up, to show people how simple it is... this book has a lot of correspondence in it (so it has salutations and signatures and stuff like that), and i haven't done such books before, so i'm gonna have to figure out how to handle all of that in z.m.l., which means looking at a bunch of p.g. e-texts to see how people have represented them in the .txt versions up to this point in time... meanwhile, i'm just dealing with them in a presentational manner, either left-justifying or centering or right-justifying, as appropriate. it should be easy to see how i've indicated that by viewing the pages. (i simply used underbars and equal-signs to indicate each of them.) the book still needs some work. but even now, it's a demonstration of how quickly one person can make a book available to the public... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Mon Mar 1 21:19:25 2010 From: jimad at msn.com (Jim Adcock) Date: Mon, 1 Mar 2010 21:19:25 -0800 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <201002260020.25754.donovan@abs.net> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <201002260020.25754.donovan@abs.net> Message-ID: >Those 6200+ works already are available to the public, at minimum in scanned pages form, and most of them with OCR available. The argument that these works are "trapped" is a red herring stemming from frustration over how long it now takes the DP process to produce a "finished" version of the text. Sorry, but this is NOT a "red herring". Looking at DP's own statistics on this subject, the release rate is about 2/3rds the project start rate -- and has been for many years. Why does this matter -- "eventually all projects will get released?" Yes, but by the time "eventually" happens enough more new books will be stuck on queues that it will continue to be true that the release rate is about 2/3rds the project start rate. This means DP is running in a "self similar" mode where effectively 1/3 of all projects that get started DON'T get released. Which means that 1/3 of all volunteer effort is being wasted. One might say "OK, let's just slow down the project start rate." If you do that then P1s do not have interesting projects to work on and they get frustrated and go do something else with their time. But DP NEEDS to have the P1s because DP grows those -- eventually -- to be the P3s and the F2s and the PPs necessary to get the queues unstuck. But the queues can't get unstuck because increasing the start rate to attract the P1s in turn clogs the queues. So again, what is the solution? 1) Increase the number of P3s, F2s, and PPs by reducing the qualifications. Or 2) improve the tools available to P3s, F2s, and PPs to make them more productive. DP can't fix the problem without changing. If you don't understand this, please take a closer look at the plot that DP makes available at: http://www.pgdp.net/c/stats/stats_central.php where you can see that one third of projects created DO NOT get released because they are stuck on queues. As more books get released it is also true that more books get stuck on queues and the ratio remains the same: 1/3 of books DO NOT get released because they are stuck on queues. Which means that 1/3 of volunteer efforts are being wasted by a flawed process. From jimad at msn.com Mon Mar 1 21:41:01 2010 From: jimad at msn.com (James Adcock) Date: Mon, 1 Mar 2010 21:41:01 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> Message-ID: >> I think we're worried about the fact that the only version available >> is one that you have to BUY that's based on our volunteer labor. >If that's at least an option, why not? Nobody forces you to buy it, >though. Again, I as an unpaid volunteer don't appreciate having my time and effort converted into a for-profit enterprise before my public domain efforts have reached fruition through DP. The end result is that I get turned off of DP and go "solo" instead. When I go "solo" I admittedly create works that are *somewhat* more buggy than DP claims to make. The difference it that my efforts see the light of day this month rather than three and a half years from now. When my NFP volunteer efforts are used poorly then I find somewhere else to volunteer my time and efforts. Why should DP care? Well, which "DP" are we talking about? The DP made up of volunteers who get frustrated by the inefficiencies and leave? Or the DP made up of lifers who don't want to see change? -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Mon Mar 1 21:55:26 2010 From: dakretz at gmail.com (don kretz) Date: Mon, 1 Mar 2010 21:55:26 -0800 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <201002260020.25754.donovan@abs.net> Message-ID: <627d59b81003012155w1f6b5c87n79213695a34a9574@mail.gmail.com> It's worse than that. We all know there is a large invisible queue of projects that aren't being posted at all because of the daunting prospect of possibly never seeing your project complete in your own lifetime. And we keep adding tricky new loops and spins for the benefit of one or another deserving category of workers or project types, making the ability to forecast the schedule for *your* project highly speculative. -------------- next part -------------- An HTML attachment was scrubbed... URL: From walter.van.holst at xs4all.nl Mon Mar 1 22:33:28 2010 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Tue, 02 Mar 2010 07:33:28 +0100 Subject: [gutvol-d] Re: =?utf-8?q?=5BSPAM=5D_Re=3A_Re=3A_the_d=2Ep=2E_opinion_on_=22p?= =?utf-8?q?rerelease=22_of=09e-texts?= In-Reply-To: References: "\"<20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> " " <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> Message-ID: On Mon, 1 Mar 2010 21:41:01 -0800, "James Adcock" wrote: > Again, I as an unpaid volunteer don't appreciate having my time and > effort converted into a for-profit enterprise before my public domain > efforts have reached fruition through DP. The end result is that I get > turned off of DP and go "solo" instead. When I go "solo" I admittedly > create works that are *SOMEWHAT* more buggy than DP claims to make. The > difference it that my efforts see the light of day this month rather than > three and a half years from now. When my NFP volunteer efforts are used > poorly then I find somewhere else to volunteer my time and efforts. Why > should DP care? Well, which "DP" are we talking about? The DP made up of > volunteers who get frustrated by the inefficiencies and leave? Or the DP > made up of lifers who don't want to see change? The end result will still be in the public domain and can be scooped up by any entity, commercial or non-commercial. I don't really see the point you are trying to make. Regards, Walter From sankarrukku at gmail.com Mon Mar 1 22:50:27 2010 From: sankarrukku at gmail.com (Sankar Viswanathan) Date: Tue, 2 Mar 2010 12:20:27 +0530 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> Message-ID: Why > should DP care? Well, which "DP" are we talking about? The DP made up of > volunteers who get frustrated by the inefficiencies and leave? Or the DP > made up of lifers who don't want to see change? The above two categories form a very small percentage of D.P volunteers. The vast majority (who are silent) are continuing to work in D.P. They are aware of the problems and hope that solutions would be found shortly. They are convinced that the D.P Board would implement changes for effecting a better flow of the books. -- Sankar Service to Humanity is Service to God -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Tue Mar 2 08:37:42 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 2 Mar 2010 08:37:42 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: "\"<20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> " " <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> Message-ID: >The end result will still be in the public domain and can be scooped up by any entity, commercial or non-commercial. I don't really see the point you are trying to make. The "end result" to date is that a commercial company has taken my not-for-profit work off DP at SR time and redistributed it under DRM such that it cannot to date be "scooped up" by any other entity, commercial or non-commercial. The "end result" to date is that the donation of my time and effort to a non-profit activity has been privatized for other's profit without any contribution to the non-profit community. This is typically called "conversion" and is typically considered at least morally to be theft of non-profit contributions. If I wanted to work for profit I would do so in the first place -- and would do so for my own profit rather that of bottom feeders who prey on DP. Again, if "DP" [whoever that is] doesn't care about these issues, *I DO*, and so I will put my volunteer efforts elsewhere -- where my volunteer efforts WILL go in fact into NFP, and where my volunteer efforts WILL make a positive impact on the world in a finite amount of time. From grythumn at gmail.com Tue Mar 2 08:59:36 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Tue, 2 Mar 2010 11:59:36 -0500 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> Message-ID: <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> On Tue, Mar 2, 2010 at 11:37 AM, Jim Adcock wrote: > The "end result" to date is that a commercial company has taken my > not-for-profit work off DP at SR time and redistributed it under DRM such > that it cannot to date be "scooped up" by any other entity, commercial or > non-commercial. ?The "end result" to date is that the donation of my time > and effort to a non-profit activity has been privatized for other's profit > without any contribution to the non-profit community. ?This is typically > called "conversion" and is typically considered at least morally to be theft > of non-profit contributions. ?If I wanted to work for profit I would do so > in the first place -- and would do so for my own profit rather that of > bottom feeders who prey on DP. Again, if "DP" [whoever that is] doesn't care > about these issues, *I DO*, and so I will put my volunteer efforts elsewhere > -- where my volunteer efforts WILL go in fact into NFP, and where my > volunteer efforts WILL make a positive impact on the world in a finite > amount of time. I'm not sure if you understand what "Public Domain" means. It is not not-for-profit... it means there is _no_ restriction on further use of the text. Someone can reprint it, use it for derivative works, fold, spindle, mutilate, write slash, whatever, at any point[0]. There is no copyright restriction attached, and *no legal way to prevent redistribution*[1]. It also works the other way... the independent commercial entity that republished the text on Amazon has no way to prevent us from putting the final, polished text up *for free* at PG once it finishes PP/PPV. Also, it can indeed be "scooped up" by anyone else who wishes to at DP before that point. DP, the organization, is a not-for-profit. The material that the organization works upon are Public Domain in the US. R C [0] Technically there is an automatic copyright on the annotations that the proofers insert... they'd have to strip the [**] notes. [1] Trademarks can turn up in specific cases, but that's another issue entirely. From jimad at msn.com Tue Mar 2 09:43:34 2010 From: jimad at msn.com (Jim Adcock) Date: Tue, 2 Mar 2010 09:43:34 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> Message-ID: >I'm not sure if you understand what "Public Domain" means. I certainly understand what it means. I volunteer my not-for-profit efforts to make public domain works. Those works IN PRACTICE enter the public domain when PG makes them available to the public, not before then. When books get stuck on DP queues "forever" then for-profits pick them up from SR and distribute them under DRM at which point in time the book still IN PRACTICE fails to enter the public domain. This makes me unhappy, not principally because a for-profit has picked up the book but rather because DP continues to fail to recognize that their current queuing system and work rules are busted, such that effectively one third of the effort contributed to DP never in practice reaches the public domain, which in turn wastes my time and effort when I volunteer there -- not to mention more importantly the time and effort of 1000's of others who volunteer there. But, instead of recognizing that the current system is busted and that people there need to fix it what happens instead is that DP'ers insult the intelligence of people who try to point out to them that the current system is in fact busted. Again, under the current DP system for every three books started two books get released. This means that about 1/3 of the DP volunteers efforts are effectively being wasted. From klofstrom at gmail.com Tue Mar 2 10:16:27 2010 From: klofstrom at gmail.com (Karen Lofstrom) Date: Tue, 2 Mar 2010 08:16:27 -1000 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> Message-ID: <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> On Tue, Mar 2, 2010 at 7:43 AM, Jim Adcock wrote: > But, instead of recognizing that the current system is busted and that people there need to fix it what happens instead is that DP'ers insult the intelligence of people who try to point out to them that the current system is in fact busted. Jim, we've known that it's busted for quite some time. You don't need to scream at us and tell us we're idiots and fools if we don't do what YOU order us to do, immediately. The negative reaction you're getting is to your tone and tactics, not your news flash. The problem is knowing just how the fix the beast while it's careering along -- like fixing your car while it's in motion. Because I'm not a programmer, I can't contribute to the solution, but I have high hopes that someone will code a system that can be shown (by experiment, in practice) to work better. Once there's a working prototype, you'll see movement. -- Karen Lofstrom aka Zora From grythumn at gmail.com Tue Mar 2 10:29:41 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Tue, 2 Mar 2010 13:29:41 -0500 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> Message-ID: <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> On Tue, Mar 2, 2010 at 12:43 PM, Jim Adcock wrote: >>I'm not sure if you understand what "Public Domain" means. > I certainly understand what it means. ?I volunteer my not-for-profit efforts > to make public domain works. ?Those works IN PRACTICE enter the public > domain when PG makes them available to the public, not before then. When > books get stuck on DP queues "forever" then for-profits pick them up from SR > and distribute them under DRM at which point in time the book still IN > PRACTICE fails to enter the public domain. ?This makes me unhappy, not > principally because a for-profit has picked up the book but rather because > DP continues to fail to recognize that their current queuing system and work > rules are busted, such that effectively one third of the effort contributed > to DP never in practice reaches the public domain, which in turn wastes my > time and effort when I volunteer there -- not to mention more importantly > the time and effort of 1000's of others who volunteer there. But, instead of > recognizing that the current system is busted and that people there need to > fix it what happens instead is that DP'ers insult the intelligence of people > who try to point out to them that the current system is in fact busted. > Again, under the current DP system for every three books started two books > get released. ?This means that about 1/3 of the DP volunteers efforts are > effectively being wasted. Copyright works have to be in the public domain before any at DP touches it. It's still in the public domain while at DP, and it is in the public domain when it leaves DP for PG. We can try[1] to restrict access to intermediate stages by technical means, but we do NOT have any legal means to prevent redistribution short of trying something with contract law (a EULA or such).[2] You also seem to believe there is a black hole at DP where 1 out of 3 books fall into, never to emerge. This is a patent fallacy. Some books DO get shortstopped in the middle of the process (for missing pages and other issues) but it is nowhere near 1 in 3 and there is significant effort (the project hospital) to push these back into the active process. The closest thing to a black hole is PP: Available, where books can indeed sit indefinitely... but most don't. I'm not going to argue this any further with you, though. People have long been aware of the problem, and it is clear that nothing I say will influence you. R C [1] It would be a bad idea IMO, but it has been tried in the past. [2] Which would be both impractical, and against the principles of trying to get public domain works accessible, again IMO. From marcello at perathoner.de Tue Mar 2 11:16:13 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 02 Mar 2010 20:16:13 +0100 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> Message-ID: <4B8D63FD.5020102@perathoner.de> Robert Cicconetti wrote: > Copyright works have to be in the public domain before any at DP > touches it. It's still in the public domain while at DP, and it is in > the public domain when it leaves DP for PG. We can try[1] to restrict > access to intermediate stages by technical means, but we do NOT have > any legal means to prevent redistribution short of trying something > with contract law (a EULA or such).[2] What??? Are you saying everybody can steal everybody's else's files if they contain only PD material? If you *publish* PD material, everybody can take it and re-use it as they see fit. To publish something means to make it available to everybody. If you keep PD material on a workgroup server which is not accessible to the public at large and somebody grabs this material without your permission, then the material is *stolen* and you can prosecute them. (Provided you can prove that it was indeed your file, which should not be difficult because the scanno pattern is practically a watermark.) -- Marcello Perathoner webmaster at gutenberg.org From grythumn at gmail.com Tue Mar 2 11:31:10 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Tue, 2 Mar 2010 14:31:10 -0500 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B8D63FD.5020102@perathoner.de> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> Message-ID: <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com> On Tue, Mar 2, 2010 at 2:16 PM, Marcello Perathoner wrote: > Robert Cicconetti wrote: > What??? > > Are you saying everybody can steal everybody's else's files if they contain > only PD material? > > If you *publish* PD material, everybody can take it and re-use it as they > see fit. To publish something means to make it available to everybody. > > If you keep PD material on a workgroup server which is not accessible to the > public at large and somebody grabs this material without your permission, > then the material is *stolen* and you can prosecute them. (Provided you can > prove that it was indeed your file, which should not be difficult because > the scanno pattern is practically a watermark.) We're not talking about computer trespassing; the discussion is in regards to publicly available public domain material, not locked up on someone's personal computer or server. PG has procedures for establishing whether a random etext found online is public domain work, and allowing people to republish it at PG. http://www.gutenberg.org/wiki/Gutenberg:Copyright_Confirmation_How-To Random scannos do not establish a new copyrightable work, nor does sweat-of-brow. (Under current US law, etc etc.) R C From dakretz at gmail.com Tue Mar 2 11:51:44 2010 From: dakretz at gmail.com (don kretz) Date: Tue, 2 Mar 2010 11:51:44 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> Message-ID: <627d59b81003021151m60c0b2d3ue0a5f18981d22a69@mail.gmail.com> The queues also seem to have the effect of promoting the release of short, easier projects at the expense of longer, more challenging ones. Consequently some of the more significant works are delayed. In June of 2005, the nine volumes of The Works of William Shakespeare - Cambridge Editionwere submitted. This was before the queues era, and the records aren't clear, but the first volume (processed as 6 separate projects, 1 play per project) were completed and became available by the end of 2006. Volumes 2 to 8 are sitting in the F2 queue, waiting to be released so they can be formatted as the last step before post-processing and eventual submission to PG. The first of them has yet to make its way completely through since the introduction of queueing. (I can't tell where Volume 9 is - it may not have been submitted yet.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Tue Mar 2 12:02:47 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 02 Mar 2010 21:02:47 +0100 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com> Message-ID: <4B8D6EE7.5070409@perathoner.de> Robert Cicconetti wrote: > On Tue, Mar 2, 2010 at 2:16 PM, Marcello Perathoner > wrote: >> Robert Cicconetti wrote: >> What??? >> >> Are you saying everybody can steal everybody's else's files if they contain >> only PD material? >> >> If you *publish* PD material, everybody can take it and re-use it as they >> see fit. To publish something means to make it available to everybody. >> >> If you keep PD material on a workgroup server which is not accessible to the >> public at large and somebody grabs this material without your permission, >> then the material is *stolen* and you can prosecute them. (Provided you can >> prove that it was indeed your file, which should not be difficult because >> the scanno pattern is practically a watermark.) > > We're not talking about computer trespassing; the discussion is in > regards to publicly available public domain material, not locked up on > someone's personal computer or server. We are talking about files that are sitting in some queue on a DP server. The DP server is not publicly accessible: It asks for a password. Taking a file out of a password-protected site and making it public without the site owner's permission is illegal. It is irrelevant if the file contains PD material or not. Try an art collector's home and explain to him that you have a *right* to enter and photograph his Monet because it happens to be in the public domain... -- Marcello Perathoner webmaster at gutenberg.org From greg at durendal.org Tue Mar 2 12:01:14 2010 From: greg at durendal.org (Greg Weeks) Date: Tue, 2 Mar 2010 15:01:14 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> Message-ID: On Tue, 2 Mar 2010, Robert Cicconetti wrote: > redistribution*[1]. It also works the other way... the independent > commercial entity that republished the text on Amazon has no way to > prevent us from putting the final, polished text up *for free* at PG > once it finishes PP/PPV. Also, it can indeed be "scooped up" by anyone > else who wishes to at DP before that point. Well no it can't. Mostly they put DRM on it, so it's a felony in the US to do anything with it. Now if someone like manybooks gets it I don't care. -- Greg Weeks http://durendal.org:8080/greg/ From grythumn at gmail.com Tue Mar 2 12:07:48 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Tue, 2 Mar 2010 15:07:48 -0500 Subject: [gutvol-d] Re: [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> Message-ID: <15cfa2a51003021207v51a17093g11b62a73bd91df2e@mail.gmail.com> On Tue, Mar 2, 2010 at 3:01 PM, Greg Weeks wrote: > On Tue, 2 Mar 2010, Robert Cicconetti wrote: > >> redistribution*[1]. It also works the other way... the independent >> commercial entity that republished the text on Amazon has no way to >> prevent us from putting the final, polished text up *for free* at PG >> once it finishes PP/PPV. Also, it can indeed be "scooped up" by anyone >> else who wishes to at DP before that point. > > Well no it can't. Mostly they put DRM on it, so it's a felony in the US to > do anything with it. Now if someone like manybooks gets it I don't care. "Also, it can indeed be "scooped up" by anyone else who wishes to at DP before that point." Note I said it is accessible at DP, not suggesting that one break DRM. -Bob From greg at durendal.org Tue Mar 2 12:04:35 2010 From: greg at durendal.org (Greg Weeks) Date: Tue, 2 Mar 2010 15:04:35 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B8D6EE7.5070409@perathoner.de> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com> <4B8D6EE7.5070409@perathoner.de> Message-ID: On Tue, 2 Mar 2010, Marcello Perathoner wrote: > We are talking about files that are sitting in some queue on a DP server. The > DP server is not publicly accessible: It asks for a password. Taking a file > out of a password-protected site and making it public without the site > owner's permission is illegal. It is irrelevant if the file contains PD > material or not. I suspect that wouldn't fly in the US. There's no restriction on getting an account, so it's likely there was no trespass. Maybe a TOS violation, but I don't think there's anything preventing this in the DP TOS, and I don't think there should be in general. Even if it does sometimes irritate me. -- Greg Weeks http://durendal.org:8080/greg/ From marcello at perathoner.de Tue Mar 2 12:29:08 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 02 Mar 2010 21:29:08 +0100 Subject: [gutvol-d] Re: [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com> <4B8D6EE7.5070409@perathoner.de> Message-ID: <4B8D7514.6080705@perathoner.de> Greg Weeks wrote: > On Tue, 2 Mar 2010, Marcello Perathoner wrote: > >> We are talking about files that are sitting in some queue on a DP >> server. The DP server is not publicly accessible: It asks for a >> password. Taking a file out of a password-protected site and making it >> public without the site owner's permission is illegal. It is >> irrelevant if the file contains PD material or not. > > I suspect that wouldn't fly in the US. There's no restriction on getting > an account, so it's likely there was no trespass. Maybe a TOS violation, > but I don't think there's anything preventing this in the DP TOS, and I > don't think there should be in general. Even if it does sometimes > irritate me. That would very well fly. I don't believe the DP TOS allow you to take a file out and publish it on your own. And if they allow that, I don't understand all the fuss they are making against a PG preprint distribution. Oh, and all those signs that say you can't take any pictures in US. museums, don't they fly? -- Marcello Perathoner webmaster at gutenberg.org From greg at durendal.org Tue Mar 2 12:36:19 2010 From: greg at durendal.org (Greg Weeks) Date: Tue, 2 Mar 2010 15:36:19 -0500 (EST) Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B8D7514.6080705@perathoner.de> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com> <4B8D6EE7.5070409@perathoner.de> <4B8D7514.6080705@perathoner.de> Message-ID: On Tue, 2 Mar 2010, Marcello Perathoner wrote: > Greg Weeks wrote: >> On Tue, 2 Mar 2010, Marcello Perathoner wrote: >> >>> We are talking about files that are sitting in some queue on a DP server. >>> The DP server is not publicly accessible: It asks for a password. Taking a >>> file out of a password-protected site and making it public without the >>> site owner's permission is illegal. It is irrelevant if the file contains >>> PD material or not. >> >> I suspect that wouldn't fly in the US. There's no restriction on getting an >> account, so it's likely there was no trespass. Maybe a TOS violation, but I >> don't think there's anything preventing this in the DP TOS, and I don't >> think there should be in general. Even if it does sometimes irritate me. > > That would very well fly. I don't believe the DP TOS allow you to take a file > out and publish it on your own. And if they allow that, I don't understand > all the fuss they are making against a PG preprint distribution. It's generally been admitted that they can't stop it. It's if it should be officially sanctioned or not. > Oh, and all those signs that say you can't take any pictures in US. museums, > don't they fly? Only to the extent that if they ask you to leave and if you don't comply you are trespassing. They cannot make you delete any pictures you've taken. They can't stop you from doing anything with the picture you want if the art doesn't currently have a copyright. -- Greg Weeks http://durendal.org:8080/greg/ From grythumn at gmail.com Tue Mar 2 12:51:18 2010 From: grythumn at gmail.com (Robert Cicconetti) Date: Tue, 2 Mar 2010 15:51:18 -0500 Subject: [gutvol-d] Re: [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B8D7514.6080705@perathoner.de> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com> <4B8D6EE7.5070409@perathoner.de> <4B8D7514.6080705@perathoner.de> Message-ID: <15cfa2a51003021251g75da7277g975f569048bf06c1@mail.gmail.com> On Tue, Mar 2, 2010 at 3:29 PM, Marcello Perathoner wrote: > That would very well fly. I don't believe the DP TOS allow you to take a > file out and publish it on your own. And if they allow that, I don't > understand all the fuss they are making against a PG preprint distribution. The difference is between something that is tolerated, and an officially sanctioned central repository. Also, I think the arguments for posting text and HTML separately got confused with the arguments about posting earlier in the process. phpBB's threading is... suboptimal. Personally, I'm in the pre-publish camp (after it passes each round, by preference. There's little point in splitting TXT and HTML posting at PP). As well as making p1->p1 opt-out, p3 opt in, parallel f1 opt-out[1], and f2 opt in. R C [1] Means a little more work for the PM to do the merge, but worth it IMO for simpler works. Would need some relatively minor tool or dev support. From Bowerbird at aol.com Tue Mar 2 14:12:50 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 2 Mar 2010 17:12:50 EST Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts Message-ID: <77bdd.26df5c3.38bee762@aol.com> bob said, to jim- > I'm not going to argue this any further with you, though. truth be told, you've haven't provided any argumentation anyway. you ignored jim's main point, to argue some legalistic crap which jim knows quite well and was never in dispute. indeed, it is precisely the troubling fact that material which _is_ "in the public domain" in a _legal_ sense, but is only _available_ for sale, because d.p. can't get it out the door, that's the point... and if you have nothing to say in regard to that point, then it's probably a good thing that you stop posting replies of any type. (except that you've illustrated jim's point about d.p. apologists.) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbnewby at pglaf.org Tue Mar 2 14:16:37 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Tue, 2 Mar 2010 14:16:37 -0800 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B8D63FD.5020102@perathoner.de> References: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> Message-ID: <20100302221637.GA27060@pglaf.org> On Tue, Mar 02, 2010 at 08:16:13PM +0100, Marcello Perathoner wrote: > Robert Cicconetti wrote: > > >Copyright works have to be in the public domain before any at DP > >touches it. It's still in the public domain while at DP, and it is in > >the public domain when it leaves DP for PG. We can try[1] to restrict > >access to intermediate stages by technical means, but we do NOT have > >any legal means to prevent redistribution short of trying something > >with contract law (a EULA or such).[2] > > What??? > > Are you saying everybody can steal everybody's else's files if they > contain only PD material? > > If you *publish* PD material, everybody can take it and re-use it as > they see fit. To publish something means to make it available to > everybody. > > If you keep PD material on a workgroup server which is not > accessible to the public at large and somebody grabs this material > without your permission, then the material is *stolen* and you can > prosecute them. (Provided you can prove that it was indeed your > file, which should not be difficult because the scanno pattern is > practically a watermark.) These don't seem like strongly conflicting statements. Our "no sweat of the brow how-to" gives a similar view. IF someone were to gain illicit access to files at DP or elsewhere, regardless of whether they were public domain, various legal remedies could be applied. (Quite a few, and most countries have their own set of remedies ranging from contracts, to EULAs, to things like computer fraud & abuse or misappropriation of resources.) But as Robert mentioned, that doesn't change that the public domain content is still public domain...no matter how much value has been added through scanning, OCR, proofreading, etc. What happens if such content mysterioulsy, untraceably extracts itself from DP and becomes available elsewhere? Well, it's still public domain. (Bonus reading assignment: Steven Levy's "Crypto," which describes how the PGP software, which was ineligible for export from the US, found its way into other countries -- where it was perfectly legal to use.) -- Greg PS: Over the years, I've been involved in various efforts to bring legal remedies to online incidents. It is very hard to do, especially when there is little or no money involved. Doubly-especially if any of the actors are in different countries. Robert's emphasis on technical measures, versus more legalistic ones, is more likely to give satisfaction. From jimad at msn.com Tue Mar 2 14:41:56 2010 From: jimad at msn.com (James Adcock) Date: Tue, 2 Mar 2010 14:41:56 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: >The negative reaction you're getting is to your tone and tactics, not your news flash. Sorry, but *my* negative reactions are based on DP people who say: a) That there is no problem having books stuck on queues for an average of 3.5 years now. And/or b) Offer "solutions" which will not in fact reduce the size of the queues and how long books sit there. Again: a) There IS a problem with having books stuck on queues, including that fact that 1/3 of the volunteers' time and energy is being wasted currently. b) Any proposed "solution" has to in fact act to reduce the size of the queues and how long books sit there. And it needs to do so without chasing away any class of volunteers including P1s -- since P1s represent the future of DP. One simple suggestion to start with would be to start by changing the stated "Goals" for P3 and F2 and PP to be larger than the Goals for P2 and F1. To do otherwise is to have DP suggesting that they want the queues to be even longer than they are now. Right now the stated goals for P2 and F1 are larger than the stated goals for P3 and PP -- which will only make the queuing situation worse. The fact that the "Goals" are inverted would seem to imply that the powers that be do not understand the nature of the problem -- in which case how can they fix it? From dakretz at gmail.com Tue Mar 2 14:42:55 2010 From: dakretz at gmail.com (don kretz) Date: Tue, 2 Mar 2010 14:42:55 -0800 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <20100302221637.GA27060@pglaf.org> References: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <20100302221637.GA27060@pglaf.org> Message-ID: <627d59b81003021442j1ff97b5bk4b0dc96604c02c22@mail.gmail.com> And what's the message that we send when we use someone else's work (the book) that someone else scans, and someone else collects, posts, and manages (TIA) and a bunch of other people proof and/or format, and then keep that accumulated and integrated value that's been generously and freely provided for us to usel locked away exclusively for several years for one Post Processor to work on, when they get around to it? On Tue, Mar 2, 2010 at 2:16 PM, Greg Newby wrote: > On Tue, Mar 02, 2010 at 08:16:13PM +0100, Marcello Perathoner wrote: > > Robert Cicconetti wrote: > > > > >Copyright works have to be in the public domain before any at DP > > >touches it. It's still in the public domain while at DP, and it is in > > >the public domain when it leaves DP for PG. We can try[1] to restrict > > >access to intermediate stages by technical means, but we do NOT have > > >any legal means to prevent redistribution short of trying something > > >with contract law (a EULA or such).[2] > > > > What??? > > > > Are you saying everybody can steal everybody's else's files if they > > contain only PD material? > > > > If you *publish* PD material, everybody can take it and re-use it as > > they see fit. To publish something means to make it available to > > everybody. > > > > If you keep PD material on a workgroup server which is not > > accessible to the public at large and somebody grabs this material > > without your permission, then the material is *stolen* and you can > > prosecute them. (Provided you can prove that it was indeed your > > file, which should not be difficult because the scanno pattern is > > practically a watermark.) > > These don't seem like strongly conflicting statements. Our "no sweat of > the brow how-to" gives a similar view. > > IF someone were to gain illicit access to files at DP or elsewhere, > regardless of whether they were public domain, various legal remedies > could be applied. (Quite a few, and most countries have their own set > of remedies ranging from contracts, to EULAs, to things like computer > fraud & abuse or misappropriation of resources.) > > But as Robert mentioned, that doesn't change that the public domain > content is still public domain...no matter how much value has been added > through scanning, OCR, proofreading, etc. What happens if such content > mysterioulsy, untraceably extracts itself from DP and becomes available > elsewhere? Well, it's still public domain. > > (Bonus reading assignment: Steven Levy's "Crypto," which describes how > the PGP software, which was ineligible for export from the US, found its > way into other countries -- where it was perfectly legal to use.) > > -- Greg > > PS: Over the years, I've been involved in various efforts to bring > legal remedies to online incidents. It is very hard to do, especially > when there is little or no money involved. Doubly-especially if any > of the actors are in different countries. Robert's emphasis on technical > measures, versus more legalistic ones, is more likely to give satisfaction. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 2 15:10:54 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 2 Mar 2010 18:10:54 EST Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts Message-ID: <9889.63fb9605.38bef4fe@aol.com> jim quoted someone as saying: > The negative reaction you're getting > is to your tone and tactics, not your news flash. now, i didn't see this quote when it came up originally, meaning that it probably came from someone who is in my spam folder, which would mean marcello or zora. i'm betting it's zora, the supreme apologist here for d.p. this is the kind of "blame-the-messenger" crap that they _love_ to do over at d.p. they can't argue with the message, so they talk about your "tone" instead, when it's their own damn fault that you had to adopt that tone in the first place, because they're bound and determined to ignore you totally. and that's because they are incapable of solving anything. and it's interesting to see _why_ d.p. can't solve anything, as the d.p. people here -- right on up to board member newby -- are unable to avoid dragging a thread off-topic. (although we must give marcello credit for a serious detour, by raising a phantom that files are being _stolen_ from d.p.) i mean, seriously, you want to witness something _amazing_, just take a look at recent posts in this thread, were _jim_ is the one who manages to (a) stay on topic! and (b) make sense. jim!, for crying out loud, the same jim who often has difficulty arguing his way out of a wet paper bag, and he's the one here who is doing the _best_, absolutely outshining all of the rest! so, of course, let's attack jim, and his "tone and tactics"... let me break it down to a nutshell... d.p. has thousands of proofers doing p1, the first proof pass. d.p. has hundreds of proofers in p2, the more-careful pass. d.p. has dozens of proofers for p3, the "final final pass" pass. i don't know about you, but i'd expect that a "final pass" will take a closer reading (and thus more time) than a first-pass, but assume p1 and p2 and p3 proofers all take the same time. somehow, however, the fundamental workflow at d.p. expects dozens of p3 people to keep up with thousands of p1 people... this is ridiculous on the face of it. and this is the main problem. or perhaps the _main_ problem is that the d.p. "powers that be" could seriously install such a ridiculous-on-its-face workflow... whichever way we look at it, it's purely and absolutely ludicrous. (quick, somebody please give me more synonyms for "ridiculous.") -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Tue Mar 2 15:26:12 2010 From: jimad at msn.com (James Adcock) Date: Tue, 2 Mar 2010 15:26:12 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> Message-ID: >You also seem to believe there is a black hole at DP where 1 out of 3 books fall into, never to emerge. This is a patent fallacy. The fallacy is in assuming that the only way DP can waste volunteer efforts is to never ship some particular book. On the contrary large and increasing queue sizes can as effectively waste volunteer efforts as much as never shipping some particular book. Again, consider the Russian Roulette test: DP managers randomly shoot 1/3 of the projects at DP (prior to PP). How do these murders affect the shipping rate out of DP? Answer: They don't change the shipping rate out of DP. Conclusion: If you can destroy 1/3 of the projects at DP without affecting the productivity rate out of DP then 1/3 of the productivity at DP is being wasted. How is that productivity being wasted? By sticking it on large and increasing queues. Consider a factory that only ships 2/3rds of what it ever starts to make. Does the unfinished inventory represent value or not? Well, the factory only *realizes* value by shipping product. The shipped product has value, and eventually every piece of product gets shipped, but as long as the factory only ships 2/3rds of everything it ever makes the fact remains that the cost of manufacturing is 50% higher than it need be. IE the factory is only running at 2/3rds of its potential productivity. That unfinished inventory *might* be considered to have value, but only if new owners buy out the old owners, and change the manufacturing process such that you don't have unshipped inventory plugging up the factory anymore. Or if buyers get tired of paying 50% more for products than they should be and stop buying, then the factory has an opportunity to work off that unfinished inventory, realizing its value -- assuming they can lure back buyers at the new now lower price that doesn't include the wasted 50% markup for product started but not yet shipped. In the DP case what this analogy means is that DP gets a chance to work off the inventory if and when P1s get tired of DP wasting their time and energy and thus stop putting new work into the head of the DP queue. But DP needs P1s since they represent the future of DP. Now how can it be that a factory only ships 2/3rds of what it makes but at the same time it eventually ships every item? Consider for simplicity that the factory makes rolls of toilet paper and ships those rolls out to customers based on a "First In First Out" FIFO toilet paper roll queuing system. Does every roll of toilet paper eventually get shipped? Yes. But the problem is is that the queues are constantly getting larger, and as they do so they consume 1/3rd of the factory's resources. Consider if we changed to a "Last In First Out" queuing system. Does that change the nature of the problem? NO -- a roll of toilet paper is a roll of toilet paper. But now, based on LIFO it becomes obvious that some rolls of paper never do get shipped -- the 1/3rd of the older toilet paper rolls at any given time never get shipped -- 1/3 of all toilet paper rolls every made, and the situation keeps getting worse. But the choice of FIFO vs. LIFO queuing system in no way changes the nature of the problem -- a toilet paper roll is a toilet paper roll. Thus, on the contrary to the previously stated hypothesis, it is NOT necessary to have a "black hole" in order to waste time and effort. All that is necessary is to have a large and increasing queuing system -- whether that queuing system is LIFO or FIFO. Or stated another way, large queuing systems ARE the black hole. The mere fact that any given book eventually makes it out of the queue is not sufficient to keep the large queuing systems from being a black hole -- as long as the black hole continues to suck in more than it spits out. From ke at gnu.franken.de Tue Mar 2 15:27:57 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Wed, 03 Mar 2010 00:27:57 +0100 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: (James Adcock's message of "Tue, 2 Mar 2010 14:41:56 -0800") References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: "James Adcock" writes: > a) That there is no problem having books stuck on queues for an average of > 3.5 years now. It's a storage "problem"--nothing more, nothing less. There are books waiting in the google cache for more than x years. Not to mention all the libraries... The problem is you and me, who don't want to understand that is impossible to read all the books in livetime. -- Karl Eichwalder From jimad at msn.com Tue Mar 2 15:39:18 2010 From: jimad at msn.com (James Adcock) Date: Tue, 2 Mar 2010 15:39:18 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: >The problem is you and me, who don't want to understand that is impossible to read all the books in livetime. By the same argument volunteers should stop working on DP because there are more books at PG than can be read in a lifetime... ...In fact there are more books stuck on the queues at DP than can be read in a lifetime.... From pterandon at gmail.com Tue Mar 2 15:56:13 2010 From: pterandon at gmail.com (Greg M. Johnson) Date: Tue, 2 Mar 2010 18:56:13 -0500 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts Message-ID: From: don kretz > And what's the message that we send when we use someone > else's work (the book) that someone else scans, and > someone else collects, posts, and manages (TIA) and a > bunch of other people proof and/or format, and then keep > that accumulated and integrated value that's been generously > and freely provided for us to use locked away exclusively > for several years for one Post Processor to work on, > when they get around to it? Is something else happening to the work during this time-- like papers, etc., being written on it? -- Greg M. Johnson http://pterandon.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 2 16:49:19 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 2 Mar 2010 19:49:19 EST Subject: [gutvol-d] Re: Processing eTexts Message-ID: <10608.54ae9312.38bf0c0f@aol.com> so, carel, i hope i didn't scare you away... ;+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcello at perathoner.de Tue Mar 2 23:04:32 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 03 Mar 2010 08:04:32 +0100 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <20100302221637.GA27060@pglaf.org> References: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <20100302221637.GA27060@pglaf.org> Message-ID: <4B8E0A00.7070502@perathoner.de> Greg Newby wrote: > But as Robert mentioned, that doesn't change that the public domain > content is still public domain...no matter how much value has been added > through scanning, OCR, proofreading, etc. What happens if such content > mysterioulsy, untraceably extracts itself from DP and becomes available > elsewhere? Well, it's still public domain. But you would sue them for trespass, not for copyright infringement. > PS: Over the years, I've been involved in various efforts to bring > legal remedies to online incidents. It is very hard to do, especially > when there is little or no money involved. Doubly-especially if any > of the actors are in different countries. Robert's emphasis on technical > measures, versus more legalistic ones, is more likely to give satisfaction. Amazon would be an US company though. And sueing Amazon would bring some interesting facts to the public attention as to the provenience of some material they DRM. -- Marcello Perathoner webmaster at gutenberg.org From richfield at telkomsa.net Tue Mar 2 23:22:24 2010 From: richfield at telkomsa.net (Jon Richfield) Date: Wed, 03 Mar 2010 09:22:24 +0200 Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land? In-Reply-To: References: <4B76F4C6.3030006@teksavvy.com> <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com> <4B785C40.5000304@xs4all.nl> <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com> <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com> <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com> <9F120957CF48439F9C63FD74DE1B25F7@alp2400> Message-ID: <4B8E0E30.4090905@telkomsa.net> Sorry, I have been out and had email problems etc... I strongly urge you to follow up this line of thought. There are several sites on the internet doing fine work by making valuable material available, much of which is either full of scanning errors or even in scanned form. Is it satisfactory? Certainly not. Is it worth making available against the time that someone else improves it, if ever? MOST certainly. Is it consonant with our dignity to prefer making perfection available? Certainly. Is it consonant with our dignity to sit on material in case bairns and fools think that the job should do itself? Think about it. Make it available first, and let anyone dissatisfied get busy and make it satisfactory. Cheers, Jon > > Let's just forget the whole idea of error free texts. . . . > > Ever since I started Project Gutenberg I've never seen even > one book I read, even most articles and essays, without big > bluders you would think could never be published. > > I would prefer just to get these materials in circulation-- > then worry about approaching perfection along with Xeno. > > Does anybody have a serious objection to putting the 8,000, > or so, books that were listed earlier as being in limbo, in > something like our "PrePrints" section, where we put eBooks > that are admittedly not ready for prime time??? > > Please. . . . > From gbnewby at pglaf.org Tue Mar 2 23:34:54 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Tue, 2 Mar 2010 23:34:54 -0800 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B8E0A00.7070502@perathoner.de> References: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <20100302221637.GA27060@pglaf.org> <4B8E0A00.7070502@perathoner.de> Message-ID: <20100303073454.GA12104@pglaf.org> On Wed, Mar 03, 2010 at 08:04:32AM +0100, Marcello Perathoner wrote: > Greg Newby wrote: > > >But as Robert mentioned, that doesn't change that the public domain > >content is still public domain...no matter how much value has been added > >through scanning, OCR, proofreading, etc. What happens if such content > >mysterioulsy, untraceably extracts itself from DP and becomes available > >elsewhere? Well, it's still public domain. > > But you would sue them for trespass, not for copyright infringement. Right. That was the point I was making. But finding a lawyer to take the case is tough. Getting the case before a judge is tougher. Pursuing yourself (i.e., in small claims court) is possible for people with time on their hands, but it limited in various ways. > >PS: Over the years, I've been involved in various efforts to bring > >legal remedies to online incidents. It is very hard to do, especially > >when there is little or no money involved. Doubly-especially if any > >of the actors are in different countries. Robert's emphasis on technical > >measures, versus more legalistic ones, is more likely to give satisfaction. > > Amazon would be an US company though. And sueing Amazon would bring > some interesting facts to the public attention as to the provenience > of some material they DRM. Amazon is an interesting and somewhat unique example (Google, Apple and Microsoft are also interesting, and unique in their own ways). You are right that PG or DP could sue Amazon. Some days, I think we should (they sell a lot of Project Gutenberg titles - with the "small print" intact, in various illegitimate ways). What we're talking about, though, is intentional tresspass on DP. I would be surprised if Amazon or the other big companies were interested in that. -- Greg From schultzk at uni-trier.de Wed Mar 3 00:46:48 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 3 Mar 2010 09:46:48 +0100 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <77bdd.26df5c3.38bee762@aol.com> References: <77bdd.26df5c3.38bee762@aol.com> Message-ID: Hold on a sec! Am 02.03.2010 um 23:12 schrieb Bowerbird at aol.com: > bob said, to jim- > > I'm not going to argue this any further with you, though. > > truth be told, you've haven't provided any argumentation anyway. > > you ignored jim's main point, to argue some legalistic crap which > jim knows quite well and was never in dispute. > > indeed, it is precisely the troubling fact that material which _is_ > "in the public domain" in a _legal_ sense, but is only _available_ > for sale, because d.p. can't get it out the door, that's the point... There is a difference between a text being copyright free and in the public domain.. One can put a copyright and have it be still in the public domain. Personally, as I see it PG text are more or less copyright free and in the public domain. I can use the PG text as I wish as long aas I give them credit. Which I would do. Yet, there is actually no practical way of stopping me from taking a PG text removing all hints to it. Reformatting it and publishing it( even in paper form) under copyright and thereby protecting my WORK. Naturally, I would not do this, but others due. Even if someone puts a text up for sale and copyrights it from PG or DP, there is NOTHING THEY could due against PG or DP from publishing their own version! You see, PG/DP is working with the rights of the law as they can prove where their material is coming from an it was obtained legally and they had not infringed on the copyright. As an example NOBODY in the world is going to get a copyright on Shakespeares works, so that somebody else can not produce Shakepeares works on their own!!! So once the original copyright expires that text is a free for all. Nobody, can get a copyright that will stop anybody else from publishing that text. What they can get is protection for their wok and only their work/book/publication. To come back to the point of prereleasing texts. The best way of catching someone is by using texts that are NOT error-free, Since that error just might propagate. One has then a indisputable MARK to identify your work. regards Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Wed Mar 3 01:01:52 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 3 Mar 2010 10:01:52 +0100 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <20100302221637.GA27060@pglaf.org> References: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <20100302221637.GA27060@pglaf.org> Message-ID: <498D1B7A-6EB0-4872-B654-46F29F072D51@uni-trier.de> Am 02.03.2010 um 23:16 schrieb Greg Newby: > On Tue, Mar 02, 2010 at 08:16:13PM +0100, Marcello Perathoner wrote: > These don't seem like strongly conflicting statements. Our "no sweat of > the brow how-to" gives a similar view. > > IF someone were to gain illicit access to files at DP or elsewhere, > regardless of whether they were public domain, various legal remedies > could be applied. (Quite a few, and most countries have their own set > of remedies ranging from contracts, to EULAs, to things like computer > fraud & abuse or misappropriation of resources.) > > But as Robert mentioned, that doesn't change that the public domain > content is still public domain...no matter how much value has been added > through scanning, OCR, proofreading, etc. What happens if such content > mysterioulsy, untraceably extracts itself from DP and becomes available > elsewhere? Well, it's still public domain. As I have mention in another post. In the public domain and copyrighted are to different animals. I can put source code of a program in the public domain and still maintain a copyright. The same goes for texts. > > (Bonus reading assignment: Steven Levy's "Crypto," which describes how > the PGP software, which was ineligible for export from the US, found its > way into other countries -- where it was perfectly legal to use.) > > -- Greg > > PS: Over the years, I've been involved in various efforts to bring > legal remedies to online incidents. It is very hard to do, especially > when there is little or no money involved. Doubly-especially if any > of the actors are in different countries. Robert's emphasis on technical > measures, versus more legalistic ones, is more likely to give satisfaction. Thats what DRM is. Now, how can they be applied to texts. It can only be done in the file itself. The only way to reach this is with special format for the file that can only be read our own tools, and those tools source should not be publicly available. With most readers available one can still extract the text and their by defeating its protection. This has be done with music. Effectively defeating DRM and that is why iTunes music is now DRM free. The saying still goes, if they is a will there is a way. regards Keith. From schultzk at uni-trier.de Wed Mar 3 01:18:19 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 3 Mar 2010 10:18:19 +0100 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: The way I look at is that its DP ball. Yet, If the queues are so stuck up, then DP has to shift its work force. That is get volunteers trained and motivated so that they can help clear the queues. This is simple economics. No production company can afford to produce parts for a product and not produce the end product. The only way for a company to to survive is to out-source. Which would be prerelease. Naturally, DP is not interested in making money, yet the analogy holds true, for their goals. regards Keith. Am 02.03.2010 um 23:41 schrieb James Adcock: >> The negative reaction you're getting > is to your tone and tactics, not your news flash. > > Sorry, but *my* negative reactions are based on DP people who say: > > a) That there is no problem having books stuck on queues for an average of > 3.5 years now. > > And/or > > b) Offer "solutions" which will not in fact reduce the size of the queues > and how long books sit there. > > Again: > > a) There IS a problem with having books stuck on queues, including that fact > that 1/3 of the volunteers' time and energy is being wasted currently. > > b) Any proposed "solution" has to in fact act to reduce the size of the > queues and how long books sit there. And it needs to do so without chasing > away any class of volunteers including P1s -- since P1s represent the future > of DP. > > One simple suggestion to start with would be to start by changing the stated > "Goals" for P3 and F2 and PP to be larger than the Goals for P2 and F1. To > do otherwise is to have DP suggesting that they want the queues to be even > longer than they are now. Right now the stated goals for P2 and F1 are > larger than the stated goals for P3 and PP -- which will only make the > queuing situation worse. The fact that the "Goals" are inverted would seem > to imply that the powers that be do not understand the nature of the problem > -- in which case how can they fix it? -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Wed Mar 3 01:29:42 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 3 Mar 2010 10:29:42 +0100 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: A decade or so ago I pulled the whole PG repository via ftp. I have not gotten through it. What a waste of my time??? On the other side, lets just shut everthing down as most of the consumer computers have some way of displaying scans. So we are just wasting everybodies time? regards Keith. Am 03.03.2010 um 00:27 schrieb Karl Eichwalder: > "James Adcock" writes: > >> a) That there is no problem having books stuck on queues for an average of >> 3.5 years now. > > It's a storage "problem"--nothing more, nothing less. There are books > waiting in the google cache for more than x years. Not to mention all > the libraries... > > The problem is you and me, who don't want to understand that is > impossible to read all the books in livetime. From Bowerbird at aol.com Wed Mar 3 01:38:42 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Mar 2010 04:38:42 EST Subject: [gutvol-d] do not listen, pray Message-ID: <30a2.40c90764.38bf8822@aol.com> do not listen to non-lawyers discussing legal matters. do not even listen to lawyers discussing legal matters, not unless you are paying them. do not pay lawyers, if there's any way you can help it. *** do not listen to people who are talking about "theft". or "trespass". or any other stupid crap such as that. this is project gutenberg, where we transcend via gift. *** do not listen to the people who treat d.p. as if it is a factory, where "parts" are assembled into "products". the improper metaphor will only distract from truth. the queues are not the problem, they are an _effect_ of the problem. treating symptoms is bad strategy. the queues cause problems of their own, but _those_ problems are not the cause either; do not forget that. the problem is you cannot expect dozens of people to match the output created by thousands of people. remember what the problem is. treat the problem. *** pray for the volunteers whose time and energy is wasted. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Wed Mar 3 01:40:48 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 3 Mar 2010 10:40:48 +0100 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B8E0A00.7070502@perathoner.de> References: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com> <4B8D63FD.5020102@perathoner.de> <20100302221637.GA27060@pglaf.org> <4B8E0A00.7070502@perathoner.de> Message-ID: <8AD2D132-1286-45BC-83CD-1281C1A814AD@uni-trier.de> Am 03.03.2010 um 08:04 schrieb Marcello Perathoner: > Greg Newby wrote: > >> But as Robert mentioned, that doesn't change that the public domain >> content is still public domain...no matter how much value has been added >> through scanning, OCR, proofreading, etc. What happens if such content >> mysterioulsy, untraceably extracts itself from DP and becomes available >> elsewhere? Well, it's still public domain. > > But you would sue them for trespass, not for copyright infringement. So how do you prove they did it. You have to prove that they did indeed trespass. Not an easy job to do!!! > >> PS: Over the years, I've been involved in various efforts to bring >> legal remedies to online incidents. It is very hard to do, especially >> when there is little or no money involved. Doubly-especially if any >> of the actors are in different countries. Robert's emphasis on technical >> measures, versus more legalistic ones, is more likely to give satisfaction. > > Amazon would be an US company though. And sueing Amazon would bring some interesting > facts to the public attention as to the provenience of some material they DRM. DRM is not there to protect copyright, but to protect their investment into the work they have done. Besides, in is not that hard to remove DRM, nowadays. regards Keith. From Bowerbird at aol.com Wed Mar 3 02:31:29 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Mar 2010 05:31:29 EST Subject: [gutvol-d] roundlessness -- 009 Message-ID: <41e5.5c7d201.38bf9481@aol.com> in our "glass-is-one-quarter-full" news today, i note that rfrank has this to say about using reg-ex tests on his roundless site: > It seems to be a big win to make REs that usually are used > during post-processing available to users during proofing. now if roger would realize those reg-ex checks would be even _more_ useful if they were done in book-wide preprocessing, we could award him the "glass-is-three-quarters-full" prize... but let's be thankful for the huge progress he's made thus far. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Wed Mar 3 04:50:34 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 3 Mar 2010 13:50:34 +0100 Subject: [gutvol-d] Re: do not listen, pray In-Reply-To: <30a2.40c90764.38bf8822@aol.com> References: <30a2.40c90764.38bf8822@aol.com> Message-ID: Am 03.03.2010 um 10:38 schrieb Bowerbird at aol.com: > > do not listen to the people who treat d.p. as if it is a > factory, where "parts" are assembled into "products". > the improper metaphor will only distract from truth. Oh, puppi-cock! You do not even know the difference between an analogy and a methaphor! DPs approach is that of an assembly line. Scans of pages are processed, put together, processed further, go through further processes and eventually a final product comes out. > the queues are not the problem, they are an _effect_ > of the problem. treating symptoms is bad strategy. They are part of the system and assembly line! > > the queues cause problems of their own, but _those_ > problems are not the cause either; do not forget that. Especially, if input and output are not balanced. > > the problem is you cannot expect dozens of people > to match the output created by thousands of people. > remember what the problem is. treat the problem. So you suggest slowing down the work of the volunteers, stopping them? Come on you are smarter than that. The queues definately are not the problem. There are just to few handling the output, or input depending from which side you look at them. As you claim there are thousands creating output. That output becomes input. At some stage SOMEONE has to process that output to finalize it. So QED. There need to be more volunteers working in the latter stages of the production. Nice of you to prove my point!! Cheers Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pglaf.org Wed Mar 3 07:41:48 2010 From: hart at pglaf.org (Michael S. Hart) Date: Wed, 3 Mar 2010 07:41:48 -0800 (PST) Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <77bdd.26df5c3.38bee762@aol.com> Message-ID: Keith Schultz asked why text and not scans? Here are the most obvious advantages of text over scans: 1. Speed 2. Storage 3. Searching 4. Quotations 5. Corrections The Details 1. Speed Reading from online scans can be a real pain as changing pages involves downloading another large file of scanning. Flipping through the pages becomes virtually impossible. If you have the time you can download the whole thing and then start reading, but flipping through the pages might still be a pain if they are not relationally linked, and many places still seem to forget that THEIR links do not work on YOUR SYSTEM unless the links are proper for that. 2. Storage You can store about a million eBooks of about a million character each on a terabyte drive at minimal cost and with very little hassle setting up the drive, even just a pocket terabyte drive will do, though it is slower. However, storing a million scans of books is virtually impossible for the everyday person, not to mention the problems reading them listed above. More terabytes and more cables than the average person is really willing to put up with, even for a library. 3. Searching In my own personal and professional opinion the greatest advantage to having text versus scans is searchability. I won't go into every kind of file pretending to be text but the plain text files are the most searchable and the storage space required is the least, particularly in the .zip or similar compressed formats. All the other formats seem to create errors that we have all seen where the search program can't find a word that is right there in front of us on the screen. Pretty much ANY editor or reader program does .txt files without much hassle, both for reading and searching. 4. Quotations I can cut and paste any text quotation into this article without any hassle at all from text files, but you can't do that from a scan. Same for cutting and pasting into your emails, Twitter & other IM formats, and even into .pdf files. For those who never quote anything, not a problem. However, when someone recommends I read something I will likely ask for a few choice quotations to evaluate. 5. Corrections It's difficult in the extreme to correct a scan error... you literally have to do it somethingm like Photoshop as if you were changing pixels, which you really are. It's still not easy to make those same corrections in an Adobe "Portable Document File" as they are NOT PORTABLE! Just try it a few times and you will understand. The more elevated the format, the harder is correction. /// Also, about copyright and public domain. . . . No, you can't have it both ways. . . . You can do a number of things like the PG and GNU, even the EFF, stuff like various forms of "Copyleft," but it is either copyrighted and with permission or it has the legal status of public domain to give everyone a legal, if not totally understood right to redistribute. Some of these give you ONLY the right to your own copy, without the right to hand out other copies. This means you have to read the fine print. With PG's license there is no difficulty: ALL PG eBOOKS CAN BE REDISTRIBUTED WITHOUG PG HASSLE-- there may be other laws in other countries that apply, but not from the PG license. mh From lee at novomail.net Wed Mar 3 08:15:51 2010 From: lee at novomail.net (Lee Passey) Date: Wed, 03 Mar 2010 09:15:51 -0700 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <77bdd.26df5c3.38bee762@aol.com> Message-ID: <4B8E8B37.3050707@novomail.net> On 3/3/2010 1:46 AM, Keith J. Schultz wrote: > Hold on a sec! [snip] > There is a difference between a text being copyright free and in the > public domain.. > One can put a copyright and have it be still in the public domain. On 3/3/2010 2:38 AM, Bowerbird at aol.com wrote: > do not listen to non-lawyers discussing legal matters. Good advice. Mr. Schultz, you are wrong. If something is in the public domain, by definition it cannot have a copyright, and vice-versa. There is, in fact, no such legally recognized entity as "the public domain." The phrase is simply shorthand for "those works for which copyright has expired or is otherwise unenforceable." I have heard it argued (by lawyers) that under the Berne convention one cannot create a copyrightable work and then dedicate it to the public domain. Under Berne, a copyright attaches automatically, instantaneously and unavoidably at the moment of creation. Because there is no real entity called "the public domain," the automatic copyright cannot be transferred to it. At best you have a promise on the part of the creator, unsupported by any consideration, not to sue. If no one has placed detrimental reliance on the promise, the creator can revoke it at any time, putting us back to square one. Just one of the noxious (and perhaps unintended) consequences of the Berne convention. From schultzk at uni-trier.de Wed Mar 3 08:21:03 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 3 Mar 2010 17:21:03 +0100 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <77bdd.26df5c3.38bee762@aol.com> Message-ID: Hi Michael, You did not quite catch the irony in my message. regards Keith. Am 03.03.2010 um 16:41 schrieb Michael S. Hart: > > Keith Schultz asked why text and not scans? > > Here are the most obvious advantages of text over scans: > > 1. Speed > > 2. Storage > > 3. Searching > > 4. Quotations > > 5. Corrections > > > > The Details > > > 1. Speed > > Reading from online scans can be a real pain as changing > pages involves downloading another large file of scanning. > > Flipping through the pages becomes virtually impossible. > > If you have the time you can download the whole thing and > then start reading, but flipping through the pages might > still be a pain if they are not relationally linked, and > many places still seem to forget that THEIR links do not > work on YOUR SYSTEM unless the links are proper for that. > > > 2. Storage > > You can store about a million eBooks of about a million > character each on a terabyte drive at minimal cost and > with very little hassle setting up the drive, even just > a pocket terabyte drive will do, though it is slower. > > However, storing a million scans of books is virtually > impossible for the everyday person, not to mention the > problems reading them listed above. > > More terabytes and more cables than the average person > is really willing to put up with, even for a library. > > > > 3. Searching > > In my own personal and professional opinion the greatest > advantage to having text versus scans is searchability. > > I won't go into every kind of file pretending to be text > but the plain text files are the most searchable and the > storage space required is the least, particularly in the > .zip or similar compressed formats. > > All the other formats seem to create errors that we have > all seen where the search program can't find a word that > is right there in front of us on the screen. > > Pretty much ANY editor or reader program does .txt files > without much hassle, both for reading and searching. > > > > 4. Quotations > > I can cut and paste any text quotation into this article > without any hassle at all from text files, but you can't > do that from a scan. > > Same for cutting and pasting into your emails, Twitter & > other IM formats, and even into .pdf files. > > For those who never quote anything, not a problem. > > However, when someone recommends I read something I will > likely ask for a few choice quotations to evaluate. > > > > 5. Corrections > > > It's difficult in the extreme to correct a scan error... > you literally have to do it somethingm like Photoshop as > if you were changing pixels, which you really are. > > It's still not easy to make those same corrections in an > Adobe "Portable Document File" as they are NOT PORTABLE! > Just try it a few times and you will understand. > > The more elevated the format, the harder is correction. > > > /// > > > Also, about copyright and public domain. . . . > > No, you can't have it both ways. . . . > > You can do a number of things like the PG and GNU, even > the EFF, stuff like various forms of "Copyleft," but it > is either copyrighted and with permission or it has the > legal status of public domain to give everyone a legal, > if not totally understood right to redistribute. > > Some of these give you ONLY the right to your own copy, > without the right to hand out other copies. > > This means you have to read the fine print. > > With PG's license there is no difficulty: > > ALL PG eBOOKS CAN BE REDISTRIBUTED WITHOUG PG HASSLE-- > there may be other laws in other countries that apply, > but not from the PG license. > > > mh > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From schultzk at uni-trier.de Wed Mar 3 08:39:50 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 3 Mar 2010 17:39:50 +0100 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B8E8B37.3050707@novomail.net> References: <77bdd.26df5c3.38bee762@aol.com> <4B8E8B37.3050707@novomail.net> Message-ID: Hi Lee, For one the term is "in the public domain". Furthermore, putting something in the public domain is if you care to be technical a license of use. How far that license goes depends on the statements of the author. The coining of the terminology was not originally used in copyright law, but in the protection of intellectual property. It was adopted to by the internet users and publishers to texts. Secondly you ought to get your own facts straight. How can a lawyer argue that said property not be dedicated to the public domain if not said entity is not defined!! S/He could not. regards Keith. Am 03.03.2010 um 17:15 schrieb Lee Passey: > On 3/3/2010 1:46 AM, Keith J. Schultz wrote: > >> Hold on a sec! > > [snip] > >> There is a difference between a text being copyright free and in the >> public domain.. >> One can put a copyright and have it be still in the public domain. > > On 3/3/2010 2:38 AM, Bowerbird at aol.com wrote: > > > do not listen to non-lawyers discussing legal matters. > > Good advice. > > Mr. Schultz, you are wrong. If something is in the public domain, by definition it cannot have a copyright, and vice-versa. > > There is, in fact, no such legally recognized entity as "the public domain." The phrase is simply shorthand for "those works for which copyright has expired or is otherwise unenforceable." > > I have heard it argued (by lawyers) that under the Berne convention one cannot create a copyrightable work and then dedicate it to the public domain. Under Berne, a copyright attaches automatically, instantaneously and unavoidably at the moment of creation. Because there is no real entity called "the public domain," the automatic copyright cannot be transferred to it. At best you have a promise on the part of the creator, unsupported by any consideration, not to sue. If no one has placed detrimental reliance on the promise, the creator can revoke it at any time, putting us back to square one. > > Just one of the noxious (and perhaps unintended) consequences of the Berne convention. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From cmiske at ashzfall.com Wed Mar 3 10:03:07 2010 From: cmiske at ashzfall.com (cmiske at ashzfall.com) Date: Wed, 03 Mar 2010 11:03:07 -0700 Subject: [gutvol-d] Re: Processing eTexts Message-ID: <20100303110307.0dedd0f3f91314fbc67db20f64e304ca.09cdf66229.wbe@email05.secureserver.net> An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Mar 3 10:24:24 2010 From: jimad at msn.com (Jim Adcock) Date: Wed, 3 Mar 2010 10:24:24 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: >On the other side, lets just shut everthing down as most of the consumer computers have some way of displaying scans. >So we are just wasting everybodies time? Yes and no. The Google "photocopies" of books available at books.google.com aka their PDF downloads which are just page images ARE useful, I can even read many of them successfully on my Kindle DX. There is even some charm in reading books in their original layout -- and some charm in seeing the occasional scanner's thumb. Reading pages and pages that have been scribbled on by 200 years of students is not very charming, IMHO. And the Google page images have the blotchy blurry heavy-font characteristics of bad photocopies. Even some of Google's EPUB files, which are just OCRs of these same books with all the scannos intact, can sometimes be an interesting read. The question is, in my mind, is Google preserving the books, and doing so for the public good or not? I suspect when Google digitizes the book the original is then trashed by the college library -- the whole point being they do not want to have to pay to maintain physical library books in various states of decay. Google then becomes the sole repository for this information -- excepting a smallish number of copies at TIA. Further, is Google dedicated to trying to keep this work public, or on the contrary is Google hoping for changes in the copyright law so that they can fully privatize these digitizations? Compare to what happens when volunteers at DP or PG correct a text and publish it in electronic form. Publically available? Yes. Available from a huge variety of redundant sources? Yes. Suitable to be republished easily on paper by either NFPs or For-Profit publishers? Yes. Reflowable so that it can be read comfortably on a wide variety of devices by people with differently aged eyes including by people with little or no vision? Yes. Yes. Yes. Etc. However, The DP/PG approach is extremely expensive compared to what Google is doing. Consider: Google Books == about 10 million books photo scanned. DP/PG == 30,000 books "fully restored." So Google's approach is about 300X faster than the DP/PG approach. My Conclusion: In the best of all world's there would be some measure of VALUE in choosing which books DP/PG chooses to put effort into fully restoring -- the idea that somehow DP/PG is going to be able to fully restore all the world's books is surely false. When someone at DP chooses to introduce a book that is expensive to do and the end result has relatively little value to society, that means other more important books will not be restored. It is not simply a question of "First Come First Serve" because on DP a worthy book can easily become stuck on the queues behind a less worthy book, such that the more worthy book is not allowed to be worked on by anybody. How does one measure "worthy vs. non-worthy?" Not a trivial matter, I admit. But to my mind one measure is obvious: Books that real people do not in practice want to read we should not bother to restore! I don't care if it's a book on ancient Sanskrit. If 1000 people want to read it, it's worth doing. If only 6 people want to read it, it's not worth doing. As a simple measure at least the total amount of time people spend reading the book has to exceed the amount of time volunteers spend preparing the book, or it's a loss to society. Again, the most popular books on PG are read 100,000 times more often than the least popular books. Now it's hard to find one of these most popular books to tackle today. But it is trivial to find a book to work on that will be 50X more popular than the average book DP finishes. Let Google deal with the unpopular books, and let DP/PG work on books that people actually *want* to read. From jimad at msn.com Wed Mar 3 10:29:56 2010 From: jimad at msn.com (Jim Adcock) Date: Wed, 3 Mar 2010 10:29:56 -0800 Subject: [gutvol-d] Re: do not listen, pray In-Reply-To: References: <30a2.40c90764.38bf8822@aol.com> Message-ID: >There need to be more volunteers working in the latter stages of the production. And under the current DP "high priesthood" system the only way to get more volunteers working in the latter stages of the production is to get new people working on the earlier stages of production, which then perpetuates the problem. You have to be willing to adjust or modify the "high priesthood" system. From hart at pglaf.org Wed Mar 3 10:39:39 2010 From: hart at pglaf.org (Michael S. Hart) Date: Wed, 3 Mar 2010 10:39:39 -0800 (PST) Subject: [gutvol-d] Re: do not listen, pray In-Reply-To: References: <30a2.40c90764.38bf8822@aol.com> Message-ID: On Wed, 3 Mar 2010, Jim Adcock wrote: > > >There need to be more volunteers working in the latter stages of the > production. > > And under the current DP "high priesthood" system the only way to get more > volunteers working in the latter stages of the production is to get new > people working on the earlier stages of production, which then perpetuates > the problem. You have to be willing to adjust or modify the "high > priesthood" system. I wrote an entire essay on this subject overnight, but was uncertain as to whether I should send it or not, for obvious reasons. However, this brings up at least one point I wanted to make: TO BE EFFICIENT YOU HAVE TO ADJUST YOUR HIGHER LEVELS TO LOWER LEVELS: Meaning that what the higher levels do, and how they do it, the time a higher level person is given, has to be in proportion to lower levels, or you will be inefficient, either due to to much or too little, going through the higher levels. . .it's like the gas to air ratio, driving. You get the most mileage AND the most power when the mixture is right. If people are interested, I will post at least part of that essay, I'm afraid it was VERY late at night, and I got carried away at the end. Please advise, Many thanks!!! Michael From cmiske at ashzfall.com Wed Mar 3 11:01:11 2010 From: cmiske at ashzfall.com (cmiske at ashzfall.com) Date: Wed, 03 Mar 2010 12:01:11 -0700 Subject: [gutvol-d] Re: Processing eTexts Message-ID: <20100303120111.0dedd0f3f91314fbc67db20f64e304ca.b320c83f4e.wbe@email05.secureserver.net> An HTML attachment was scrubbed... URL: From cmiske at ashzfall.com Wed Mar 3 11:06:08 2010 From: cmiske at ashzfall.com (cmiske at ashzfall.com) Date: Wed, 03 Mar 2010 12:06:08 -0700 Subject: [gutvol-d] Re: do not listen, pray Message-ID: <20100303120608.0dedd0f3f91314fbc67db20f64e304ca.71441d184d.wbe@email05.secureserver.net> An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Mar 3 11:17:08 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Mar 2010 14:17:08 EST Subject: [gutvol-d] Re: do not listen, pray Message-ID: <23668.53e77d2.38c00fb4@aol.com> i said: > thousands/hundreds/dozens and just to remind everybody, the solution is very clear, and has been for a very long time, ever since we learned -- by my careful analysis of the many d.p. experiments -- that p1 proofers find as many errors in subsequent proofings as p2 proofers and even p3 proofers do. (indeed, of the 3, the p2 proofers were the least good at locating the errors.) so it's obvious that we can move text to perfection by simply running it through p1 repeatedly. one problem with that -- as we've already found -- is that sometimes p1 proofers will change correct text to incorrect text. that problem can be eliminated easily with a policy to review and reconcile diffs. (this policy is easy to implement roundlessly, and will also serve to train up your low-quality proofers, so it's win-win.) the other problem currently with repeated p1 is that d.p. hasn't created an unambiguous set of proofing instructions -- i know, you'd think the need for that would be obvious -- and thus sometimes proofers "cycle through" corrections... (e.g., a first proofer dehyphenates, a second rehyphenates, a third asterisks the hyphen, a fourth dehyphenates, etc.) i haven't discussed the f1/f2 problem, because it's a mirror of the p1/p2/p3 problem. a quick-and-easy confirmation of the f1 by a subsequent f1 view, and we're off to the races. likewise, i have not discussed the postprocessing problem, not explicitly for the most part, because it's the microcosm of the thousands/hundreds/dozens problem, it certainly is. and once again, the problem is the workflow. by the time all of its pages have been proofed and formatted, the book should fall in place more or less naturally and automatically. the fact that it does not, in the d.p. workflow, is a shortfall... it indicates that the workflow is deficient in some major way. but correcting the postprocessing problems is relatively easy. you simply need to analyze each page that needs "finishing" and determine how the proofers could've provided that for it, and you modify the proofing instructions appropriately. easy. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pglaf.org Wed Mar 3 11:31:11 2010 From: hart at pglaf.org (Michael S. Hart) Date: Wed, 3 Mar 2010 11:31:11 -0800 (PST) Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <77bdd.26df5c3.38bee762@aol.com> <4B8E8B37.3050707@novomail.net> Message-ID: IRRC, the public domain existed, and included nearly everything in print, etc., long before copyright was implented 300 years ago in Western law. The terminology may have varied over the years, but the concept is there. Copyright [Western] was invented 250 years earlier to stifle Gutenberg's Press as a threat to The Stationers' Guild's historic monopoly. They wanted it back. And, finally, with the weak queen, Anne, they got it. And we have been stuck with it ever since!!! On Wed, 3 Mar 2010, Keith J. Schultz wrote: > Hi Lee, > > For one the term is "in the public domain". > Furthermore, putting something in the public domain > is if you care to be technical a license of use. > How far that license goes depends on the statements of > the author. > > The coining of the terminology was not originally used > in copyright law, but in the protection of intellectual property. > It was adopted to by the internet users and publishers to texts. > > Secondly you ought to get your own facts straight. How can a > lawyer argue that said property not be dedicated to the public domain > if not said entity is not defined!! > S/He could not. > > regards > Keith. > > Am 03.03.2010 um 17:15 schrieb Lee Passey: > > > On 3/3/2010 1:46 AM, Keith J. Schultz wrote: > > > >> Hold on a sec! > > > > [snip] > > > >> There is a difference between a text being copyright free and in the > >> public domain.. > >> One can put a copyright and have it be still in the public domain. > > > > On 3/3/2010 2:38 AM, Bowerbird at aol.com wrote: > > > > > do not listen to non-lawyers discussing legal matters. > > > > Good advice. > > > > Mr. Schultz, you are wrong. If something is in the public domain, by definition it cannot have a copyright, and vice-versa. > > > > There is, in fact, no such legally recognized entity as "the public domain." The phrase is simply shorthand for "those works for which copyright has expired or is otherwise unenforceable." > > > > I have heard it argued (by lawyers) that under the Berne convention one cannot create a copyrightable work and then dedicate it to the public domain. Under Berne, a copyright attaches automatically, instantaneously and unavoidably at the moment of creation. Because there is no real entity called "the public domain," the automatic copyright cannot be transferred to it. At best you have a promise on the part of the creator, unsupported by any consideration, not to sue. If no one has placed detrimental reliance on the promise, the creator can revoke it at any time, putting us back to square one. > > > > Just one of the noxious (and perhaps unintended) consequences of the Berne convention. > > _______________________________________________ > > gutvol-d mailing list > > gutvol-d at lists.pglaf.org > > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From Bowerbird at aol.com Wed Mar 3 11:39:16 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Mar 2010 14:39:16 EST Subject: [gutvol-d] Re: Processing eTexts Message-ID: <2516d.7515f64.38c014e4@aol.com> carel said: > Yes, we have been doing a semantic dance ok, that's kinda what i thought. > I just feel that a final 'proofing' stage before release > would assist in locating?errors that were either > missed or introduced by the processing. well, having "one more proofing" is _always_ a great thing. provided that somebody else is willing to _do_ it, that is... the acid test is whether you deem it to be so necessary that you will do it yourself. that's a nice way to help you decide whether the _cost_ of that additional proofing is _worth_it_, whether it will provide enough _benefit_ in the text accuracy. once you gain enough trust in your tool and its performance, believe me that you'll decide that it performs "well enough"... but yes, it is important to gauge the accuracy of your tool... my goal is less-than-1-error-every-10-pages, and my tool and workflow consistently delivers better results than that. > A human will do the processing and humans > can make mistakes and some of the mistakes > that could be made in what would be both? > error and formatting processes could be quite grand. i don't worry about errors that are "quite grand"... they're easy to spot, and obvious to debug and fix. my experience is small errors are more troubling... > I feel that a second set of eyes can never be > a bad thing when it comes to something like this. that's easy to say until we ask you to be "the second set" on a million e-texts, all of which are almost perfect now. you -- and anyone else we ask -- will say "good enough". at some point, the benefit of greater accuracy isn't worth it. and then we say, "if the people reading this book because they _want_ to read it cannot find any errors in it, then that's their problem, but we cannot spend any more time having innocent people re-proof this book _once_again_ simply because there _might_ still be an error in the thing." again, i draw the line quite specifically. if a page has been looked at by 2 people in a row who could not find an error, then i certify that page as "good enough for the public" and stop looking at it. you can make it 1 person, or 3 people, or 4 people or 8 people or 22 people, whatever you like, but nobody would ever suggest we keep proofing a book forever. now, let me be clear that i understand that you only said "a second set of eyes" and not 22 sets of them. i agree... and that's specifically why i use the comparison method, because it gives us two sets of eyes on a book, essentially. > And, those with less experience (or no experience) > in shaping the output of a text may feel more confident > about doing the process if they know someone else will > provide a checksum for their work before it goes public. except that the _public_ provides that checksum for them. they constitute your "second set of eyes", your 3rd, your 23rd. > the text would be released to PG and then should be > placed in some environment that allows for editing it > to 'perfection.' it would be nice if p.g. did this. or d.p. but neither does. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From pjb at informatimago.com Wed Mar 3 11:42:28 2010 From: pjb at informatimago.com (Pascal J. Bourguignon) Date: Wed, 3 Mar 2010 20:42:28 +0100 Subject: [gutvol-d] DP/PG vs. Google In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: On 2010-03-03, at 19:24, Jim Adcock wrote: > However, The DP/PG approach is extremely expensive compared to what > Google > is doing. Consider: Google Books == about 10 million books photo > scanned. > DP/PG == 30,000 books "fully restored." So Google's approach is > about 300X > faster than the DP/PG approach. My Conclusion: In the best of all > world's > there would be some measure of VALUE in choosing which books DP/PG > chooses > to put effort into fully restoring -- the idea that somehow DP/PG is > going > to be able to fully restore all the world's books is surely false. I think that the bet made by Google, is that sooner or later, sufficiently smart AI and OCR technology will be developed to allow to process its scans and do the job of PG automatically. The only question is when it will happen, and some think that singularity will occur within 20 years. But this is probably not a reason to stop working on PG! :-) -- __Pascal Bourguignon__ http://www.informatimago.com/ From jimad at msn.com Wed Mar 3 13:05:50 2010 From: jimad at msn.com (James Adcock) Date: Wed, 3 Mar 2010 13:05:50 -0800 Subject: [gutvol-d] Re: DP/PG vs. Google In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: >I think that the bet made by Google, is that sooner or later, Sufficiently smart AI and OCR technology will be developed to allow to process its scans and do the job of PG automatically. I would think that anyone who has worked on OCR, or automated grammars, or AI, or in making books for PG can tell you they would lose that bet! (Not that a lot can't be done to get rid of 90% of the errors "automagically!") From lee at novomail.net Wed Mar 3 13:49:50 2010 From: lee at novomail.net (Lee Passey) Date: Wed, 03 Mar 2010 14:49:50 -0700 Subject: [gutvol-d] The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: References: <30a2.40c90764.38bf8822@aol.com> Message-ID: <4B8ED97E.5020105@novomail.net> On 3/3/2010 11:29 AM, Jim Adcock wrote: > You have to be willing to adjust or modify the "high > priesthood" system. I have participated, or attempted to participate, in a number of FOSS projects over my career as a programmer, and I have a few observations which you may find relevant. Every successful FOSS project I have ever observed has started with the vision of a single individual. In the years leading up to 1995, Eric A. Young single-handedly managed to implement the full suite of cryptosystems used in SSL, and in that year made it available on the internet for free. This effort became the foundation of OpenSSL. Until he was lured away by RSA, Mr. Young was the driving force behind OpenSSL. Today, the role of visionary is played by Ralf Engelschall and Ben Laurie. In 1991 Andrew Tridgell, another Australian needed to mount disk space from a Unix server to his DOS PC. Using a packet sniffer he was able to reverse engineer the System Message Block protocol used by IBM's NetBIOS system, and which was the basis for DOS and Windows networking. This work eventually became Samga, a Unix/Linux software suite that provides file and print services to Windows-based clients. Mr. Tridgell still participates, and is the driving force behind the Samba open source project. While many have criticized his alleged heavy-handedness, I believe that the success of the Linux kernel is primarily due to the fact that Linux Torvalds still has absolute authority over what changes go into that kernel. Michael Hart plays the same role at Project Gutenberg that these programming giants played in the development of their respective software projects. Project Gutenberg was the brainchild of Mr. Hart, and he continues to be the driving force and visionary behind the project. While he, with uncharacteristic modesty, primarily credits the volunteers for the nature of Project Gutenberg, I disagree. For better or for worse, Project Gutenberg is the product of Mr. Harts vision and tenacity. Distributed Proofreaders was founded in 2000 by Charles Franks to assist in the production of electronic texts specifically to be distributed by Project Gutenberg. According to my recollection, Mr. Franks' theory was that production of e-texts was hampered by the fact that few people were willing to take on the task of producing an entire e-text, particularly through the arduous text proofreading process. His vision was to take a text and break it up into discrete units (in this case, pages) so that many people could be involved in the proofreading process and lightening the burden. Thus, the one time DP catch-phrase, "Proofread a page a day, that's all we ask." The volunteers at Distributed Proofreaders have become very good at proofreading texts. I have also seen any number of FOSS projects which have attempted to begin through consensus and team building. I can't name any of these projects for you, because they have all either failed or were still-born. I think I have learned this lesson from my observations of these projects: to be successful you must have one single visionary who controls, more or less, the project. Having that visionary will not guarantee success, but not having it will surely doom it. At if a project loses its visionary, or marginalizes him or her to the point where he or she no longer controls the vision, the project will become increasingly ineffective and inefficient, and will descend into in-fighting and turf wars as others try to control the vision. Vision cannot be obtained by consensus. When someone criticizes Project Gutenberg for supposed failings, or the inability or unwillingness to keep up with the times, and Michael Hart responds with his now inevitable suggestion to "JUST GO FOR IT," what he is saying is "what you are suggesting does not match my vision. If you feel your vision is better than mine I encourage you to go elsewhere to pursue it. We can offer some infrastructure support (disk space) and you are welcome to invite Project Gutenberg volunteers to go help you actualize your vision, but I will not substitute your vision for mine." I am not prepared to say that Distributed Proofreaders has lost its vision. It is still proofreading a lot of pages every day. It is clearly /not/ an efficient process, but efficiency was not one of the project goals. We are all familiar with the old saw that while one woman can have a baby in 9 months that doesn't mean that 9 women can have a baby in one month. I don't believe that DP is saying "if one person can proofread a text in 10 days, then 10 people can proofread it in one day," but they are saying "100 people can proofread it in two days." Distributed Proofreaders goal was to increase the speed that texts would be proofread, to lighten the load from any one individual and to make the process more fault-tolerant (if one volunteer quit, the project would not need to be restarted). What has happened is that the needs of the consumer has changed. I'm fairly certain that the proofread texts now sitting in DP's Post-Processing queue would meet Michael Hart's standards (or lack thereof, as he is continually telling me he has no standards) and could be released to Project Gutenberg as is. Other consumers, however, have higher standards, and Distributed Proofreaders is now trying to satisfy those standards as well, and those new standards require post-processing of a work as a single unit by a single person. DP's vision and expertise is in the area of distributed proofreading, not in the area of efficient e-book creation. This is why texts languish in the Post-Processing queue. Your problem, Mr. Adcock, is that you believe you can change the vision underlying either of these organization through rational argument. Vision is an intuitive, almost religious, experience, and blind faith is immune to rationality. It is virtually impossible that you will be able to change the vision either of Mr. Hart or whomever is currently the visionary at Distributed Proofreaders. I suspect that this is why Roger Frank has created his own web site for "roundless proofing;" his vision differs from that of Distributed Proofreaders, and it was simply easier to go his own way than to try and change someone else's vision. I believe I agree with every criticism you have leveled at both Project Gutenberg and Distributed Proofreaders, which is to say, I believe I accept your vision. So let me mimic the words of Michael Hart: GO FOR IT! Put together your own project to complete high-quality public domain e-books. You could certainly harvest all of the files currently in the DP post-processing queue to start with. You might be able to grab the HTML files from PG if you can find scans to go with them. Take advantage of the hardware resources that Mr. Newby has offered. Post messages here and at DP inviting volunteers to help you out. No need to return the e-books to either of those organizations; if they want them they will know where to find them. I will help out as much as possible. But please stop trying to convince Distributed Proofreaders or Project Gutenberg to accept a new vision. They are old and are set in their ways. They represent the last internet generation, not the current one. Show us the way forward, and let sleeping dogs lie. From Bowerbird at aol.com Wed Mar 3 14:17:41 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Mar 2010 17:17:41 EST Subject: [gutvol-d] Re: DP/PG vs. Google Message-ID: <3103a.2b657b5c.38c03a05@aol.com> jim said: > they would lose that bet! then jim said: > (Not that a lot can't be done to > get rid of 90% of the errors "automagically!") so you won't grant 100%, but you will grant 90%. well, google is probably betting they can get rid of 97% of the errors automatically. do you want to bet against google? because i'll take that bet against you. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Mar 3 14:57:58 2010 From: jimad at msn.com (James Adcock) Date: Wed, 3 Mar 2010 14:57:58 -0800 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <4B8ED97E.5020105@novomail.net> References: <30a2.40c90764.38bf8822@aol.com> <4B8ED97E.5020105@novomail.net> Message-ID: >I have participated, or attempted to participate, in a number of FOSS projects over my career as a programmer, and I have a few observations which you may find relevant. Sorry, but by a "high priesthood" system I mean the typical pattern of a tech organization, the same way that DP is organized, where a newbie starts at the "grunt" level, and by playing the game and following the rules advances to the roll of "Lord High Pooh-Bah." My only objection to this organization at DP is that they are not getting the right number people in each of the various roles, and don't seem to understand (or be willing to accept) what changes they would have to make in order to get the right number of people in any particular role. >Every successful FOSS project I have ever observed has started with the vision of a single individual And every (continuing to be) successful organization eventually must grow past that individual. >I don't believe that DP is saying "if one person can proofread a text in 10 days, then 10 people can proofread it in one day," but they are saying "100 people can proofread it in two days." On the contrary, the problem is that an individual, such as myself, can create a decent book in about 40 hours work over the course of one month which consists of about 720 hours elapsed time. The average book passing through DP nowadays takes over 30,000 hours elapsed time, with an average of 20 volunteers working on each book. I think we know from previous analysis that doing a book through DP takes at least 1.5X as much hands-on time as doing it "solo." Whether that is a problem or not depends on what you think about volunteers and their time. I look at it and say gee, we could be getting an additional 10,000 books out of DP if we got the system tweaked right. That seems like a change worth doing to me. Now the fact that doing a book through DP takes 40X more elapsed time than doing it "solo" is that a problem or not? Obviously some people think that taking that long corresponds to "quality" -- a project needs to age on the queues like an old cheese. Other people like me find waiting for our projects to "go live" again for a few days or weeks once or twice a year a bore and a nuisance. Some DP insiders agree that getting "scooped" by others posting that which DP is still sitting on can be disheartening -- but there seems to be a misunderstanding for who there is to blame when this happens. >Your problem, Mr. Adcock, is that you believe you can change the vision underlying either of these organization through rational argument. I wouldn't think that having a wrong number of people in any particular role at a particular point in time would be a big-enough deal as to qualify as a "vision statement". But if does then I agree this would be a problem. I would certainly agree based on personal experience that NFP organizations that run into difficulties are frequently not very receptive to rational analysis! "My problem", if we have to talk about my problems of which there are many, is that I submitted two books in good faith to DP which are now stuck there indefinitely after I contributed many many hours of my own time and tears, and I have no way to get those books back out. >GO FOR IT! I am. I create books for PG "solo." Are they are high quality as DP? No, probably not quite there. Are they created much more efficiently? Yes, much more efficiently. I have created at least one tool that makes this much more efficient for me. Others are welcome to try it if they wish. From jimad at msn.com Wed Mar 3 15:03:34 2010 From: jimad at msn.com (James Adcock) Date: Wed, 3 Mar 2010 15:03:34 -0800 Subject: [gutvol-d] Re: DP/PG vs. Google In-Reply-To: <3103a.2b657b5c.38c03a05@aol.com> References: <3103a.2b657b5c.38c03a05@aol.com> Message-ID: >do you want to bet against google? >because i'll take that bet against you. Sure, I'd be happy to take that bet, if I am allowed to win it or lose it in a finite amount of time - such as a decade. What I think is much more likely in a decade is that Google is either gives up or they figure out how to post much more attractive page images. I actually don't think they have much of any interest in posting higher quality automatic OCR transcriptions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sly at victoria.tc.ca Wed Mar 3 15:14:04 2010 From: sly at victoria.tc.ca (Andrew Sly) Date: Wed, 3 Mar 2010 15:14:04 -0800 (PST) Subject: [gutvol-d] Re: The conundrum of FOSS projects In-Reply-To: <4B8ED97E.5020105@novomail.net> References: <30a2.40c90764.38bf8822@aol.com> <4B8ED97E.5020105@novomail.net> Message-ID: Thanks for the thought-provoking post Lee. That helped put things in a new context for me. --Andrew On Wed, 3 Mar 2010, Lee Passey wrote: > I have participated, or attempted to participate, in a number of FOSS > projects over my career as a programmer, and I have a few observations > which you may find relevant. > [snip] From Bowerbird at aol.com Wed Mar 3 15:16:57 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Mar 2010 18:16:57 EST Subject: [gutvol-d] roundlessness -- 010 Message-ID: <34f8a.2b48e537.38c047e9@aol.com> by the way, i just thought i would reiterate an old screenshot... perhaps some of the people planting bugs in rfrank's ears will consider pointing him in the directions suggested here: > http://z-m-l.com/3column-zml.jpg that's a big screenshot, because i've got a big screen, but the idea is that the proofer does the word-by-word scan _not_ against a web-page's textfield version of the page, but rather against an .html-realized version of the page... (if proofers wanted, you could even use the d.p. font on it.) the main benefit is that you free 'em from having to look at the markup, because that's an unnecessary distraction. they see actual rendered italics, not the markup for italics. it's also possible this way to red-flag any possible scannos, as well as capitalization and punctuation improbabilities... you can also colorize quotations, which helps locate any missing or incorrect quotemarks. like i said, i have a big screen, so i can put up 3 pages -- the textfield, the original scan, and the .html version, but for a smaller screen, you'd put up the scan and the .html. then, only if there are changes to be made will the proofer summon the textfield for editing. i also show lots of buttons on the screen. some are there to add words to the book's custom dictionary, so that you don't have to have them flagged the next time they appear. the others are just marked with numbers, to indicate things that they could be used for, such as italicizing selected text. again, proofing the .html version is easier than the textfield. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pglaf.org Wed Mar 3 15:44:18 2010 From: hart at pglaf.org (Michael S. Hart) Date: Wed, 3 Mar 2010 15:44:18 -0800 (PST) Subject: [gutvol-d] Re: DP/PG vs. Google In-Reply-To: References: <3103a.2b657b5c.38c03a05@aol.com> Message-ID: Google's plan, from the outset, a year before we ever heard about it via the media, was to create the most "eBooks" for the cheapest cost and to generate the most media blitz public relations they could; it really had very little to do with creating high quality eBooks, tho, even I must admit, some came out better than I expected. When it comes to comparisons to PG/DP, Google is a paper tiger quite literally when it comes to quality, but when it comes to quantity it is PG/DP that is the dead tree big stripey cat. All in all, it won't hurt either way, and the ends will hit middles, with greater numbers of eBooks and greater quality. Don't forget The Internet Archive, etc. On Wed, 3 Mar 2010, James Adcock wrote: > > >do you want to bet against google? > > >because i'll take that bet against you. > > ? > > Sure, I?d be happy to take that bet, if I am allowed to win it or lose it in a finite > amount of time ? such as a decade.? What I think is much more likely in a decade is that > Google is either gives up or they figure out how to post much more attractive page images.? > I actually don?t think they have much of any interest in posting higher quality automatic > OCR transcriptions. > > > > > From hart at pglaf.org Wed Mar 3 16:06:59 2010 From: hart at pglaf.org (Michael S. Hart) Date: Wed, 3 Mar 2010 16:06:59 -0800 (PST) Subject: [gutvol-d] Re: Processing eTexts In-Reply-To: <2516d.7515f64.38c014e4@aol.com> References: <2516d.7515f64.38c014e4@aol.com> Message-ID: For those who worry about BB and "perfect" eBooks, the following should ease that worry greatly!!! On Wed, 3 Mar 2010, Bowerbird at aol.com wrote: > carel said: > >? Yes, we have been doing a semantic dance > > ok, that's kinda what i thought. > > > >?? I just feel that a final 'proofing' stage before release > >?? would assist in locating?errors that were either > >?? missed or introduced by the processing. > > well, having "one more proofing" is _always_ a great thing. > > provided that somebody else is willing to _do_ it, that is... > > the acid test is whether you deem it to be so necessary that > you will do it yourself.? that's a nice way to help you decide > whether the _cost_ of that additional proofing is _worth_it_, > whether it will provide enough _benefit_ in the text accuracy. > > once you gain enough trust in your tool and its performance, > believe me that you'll decide that it performs "well enough"... > > but yes, it is important to gauge the accuracy of your tool... > > my goal is less-than-1-error-every-10-pages, and my tool > and workflow consistently delivers better results than that. > > > >?? A human will do the processing and humans > >?? can make mistakes and some of the mistakes > >?? that could be made in what would be both? > >?? error and formatting processes could be quite grand. > > i don't worry about errors that are "quite grand"... > they're easy to spot, and obvious to debug and fix. > > my experience is small errors are more troubling... > > > >?? I feel that a second set of eyes can never be > >?? a bad thing when it comes to something like this. > > that's easy to say until we ask you to be "the second set" > on a million e-texts, all of which are almost perfect now. > > you -- and anyone else we ask -- will say "good enough". > > at some point, the benefit of greater accuracy isn't worth it. > > and then we say, "if the people reading this book because > they _want_ to read it cannot find any errors in it, then > that's their problem, but we cannot spend any more time > having innocent people re-proof this book _once_again_ > simply because there _might_ still be an error in the thing." > > again, i draw the line quite specifically.? if a page has been > looked at by 2 people in a row who could not find an error, > then i certify that page as "good enough for the public" and > stop looking at it.? you can make it 1 person, or 3 people, > or 4 people or 8 people or 22 people, whatever you like, but > nobody would ever suggest we keep proofing a book forever. > > now, let me be clear that i understand that you only said > "a second set of eyes" and not 22 sets of them.? i agree... > and that's specifically why i use the comparison method, > because it gives us two sets of eyes on a book, essentially. > > > >?? And, those with less experience (or no experience) > >?? in shaping the output of a text may feel more confident > >?? about doing the process if they know someone else will > >?? provide a checksum for their work before it goes public. > > except that the _public_ provides that checksum for them. > > they constitute your "second set of eyes", your 3rd, your 23rd. > > > >?? the text would be released to PG and then should be > >?? placed in some environment that allows for editing it > >?? to 'perfection.' > > it would be nice if p.g. did this.? or d.p.? but neither does. > > -bowerbird > > From Bowerbird at aol.com Wed Mar 3 16:48:36 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Mar 2010 19:48:36 EST Subject: [gutvol-d] Re: Processing eTexts Message-ID: <3985b.3dbb423c.38c05d64@aol.com> michael said: > For those who worry about BB and "perfect" eBooks, > the following should ease that worry greatly!!! i must admit i can no longer tell when you are serious, when you are misreading me, and when you are joking. but i've been _perfectly_ clear, and consistent, all along. i say a book that has 1-error-or-less-every-10-pages is _perfectly_ ready for release to the general public, with the explicit understanding that we do all we can to encourage and make it easy for that general public to help us in moving the books toward _perfection_... i hear lots of chestbeating about quality -- both by those who argue for it, and those who argue otherwise -- but i see very little activity productively engaged in attaining it. i've also done scads of research on how to develop tools and processes that will help us improve on our accuracy, and means by which we can have the general public help. precious little of my progress has been utilized by anyone. i'm not hung up on perfection -- it's nigh unattainable -- but neither have i ever been willing to have any other goal. i consider my position to be a fully reasonable one, and i've been perfectly clear on it, and preached it consistently, since the start. if you heard anything else, you misheard. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pglaf.org Wed Mar 3 17:14:39 2010 From: hart at pglaf.org (Michael S. Hart) Date: Wed, 3 Mar 2010 17:14:39 -0800 (PST) Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <4B8ED97E.5020105@novomail.net> References: <30a2.40c90764.38bf8822@aol.com> <4B8ED97E.5020105@novomail.net> Message-ID: While Lee's comments are pretty great, there are a few comments/corrections: On Wed, 3 Mar 2010, Lee Passey wrote: > On 3/3/2010 11:29 AM, Jim Adcock wrote: > > > You have to be willing to adjust or modify the "high > > priesthood" system. > > I have participated, or attempted to participate, in a number of FOSS > projects over my career as a programmer, and I have a few observations which > you may find relevant. > > Every successful FOSS project I have ever observed has started with the > vision of a single individual. In the years leading up to 1995, Eric A. > Young single-handedly managed to implement the full suite of cryptosystems > used in SSL, and in that year made it available on the internet for free. > This effort became the foundation of OpenSSL. Until he was lured away by > RSA, Mr. Young was the driving force behind OpenSSL. Today, the role of > visionary is played by Ralf Engelschall and Ben Laurie. > > In 1991 Andrew Tridgell, another Australian needed to mount disk space from > a Unix server to his DOS PC. Using a packet sniffer he was able to reverse > engineer the System Message Block protocol used by IBM's NetBIOS system, and > which was the basis for DOS and Windows networking. This work eventually > became Samga, a Unix/Linux software suite that provides file and print > services to Windows-based clients. Mr. Tridgell still participates, and is > the driving force behind the Samba open source project. > > While many have criticized his alleged heavy-handedness, I believe that the > success of the Linux kernel is primarily due to the fact that Linux Torvalds > still has absolute authority over what changes go into that kernel. > > Michael Hart plays the same role at Project Gutenberg that these programming > giants played in the development of their respective software projects. > Project Gutenberg was the brainchild of Mr. Hart, and he continues to be the > driving force and visionary behind the project. While he, with > uncharacteristic modesty, primarily credits the volunteers for the nature of > Project Gutenberg, I disagree. For better or for worse, Project Gutenberg is > the product of Mr. Harts vision and tenacity. > > Distributed Proofreaders was founded in 2000 by Charles Franks to assist in > the production of electronic texts specifically to be distributed by Project > Gutenberg. According to my recollection, Mr. Franks' theory was that > production of e-texts was hampered by the fact that few people were willing > to take on the task of producing an entire e-text, particularly through the > arduous text proofreading process. His vision was to take a text and break > it up into discrete units (in this case, pages) so that many people could be > involved in the proofreading process and lightening the burden. Thus, the > one time DP catch-phrase, "Proofread a page a day, that's all we ask." The > volunteers at Distributed Proofreaders have become very good at proofreading > texts. > > I have also seen any number of FOSS projects which have attempted to begin > through consensus and team building. I can't name any of these projects for > you, because they have all either failed or were still-born. Sadly to say, this is all too true, both locally and nationally, not to mention internationally. > I think I have learned this lesson from my observations of these projects: > to be successful you must have one single visionary who controls, more or > less, the project. Having that visionary will not guarantee success, but not > having it will surely doom it. At if a project loses its visionary, or > marginalizes him or her to the point where he or she no longer controls the > vision, the project will become increasingly ineffective and inefficient, > and will descend into in-fighting and turf wars as others try to control the > vision. I would like think that Project Gutenberg, and Distributed Proofreaders will continue on without me until they can't find anything more to do on eBooks, and perhaps then even continue on to something else. > Vision cannot be obtained by consensus. I suppose I have been lucky enough to have managed this once or twice. > When someone criticizes Project Gutenberg for supposed failings, or the > inability or unwillingness to keep up with the times, and Michael Hart > responds with his now inevitable suggestion to "JUST GO FOR IT," what he is > saying is "what you are suggesting does not match my vision. If you feel > your vision is better than mine I encourage you to go elsewhere to pursue "I encourage you to go elsewhere to pursue it" is not quite correct, even though there is some amerlioration below. We are more than happy to house any free eBooks efforts right here at Project Gutenberg, with or without our gutenberg.org or pglaf.org domain being associated, it's pretty much up the the people in question, and if they don't want some asscociation with PG we will provide readingroo.ms, etc., etc., etc. We will provide ALL of the infrastructure possible, and ask volunteers to help, but, being volunteers, it is really up to them. To lead here at Project Gutenberg you have to lead by example. DO SOMETHING!!! [You'll probably have to do it a couple dozen times.] Then ask others to get on the bandwagon with you and do it some more. When this works it is like starting an avalanche with snowballs. /// I think if Mr. Bowerbird had been willing to follow such a plan and to post an example of a completed book he did once a month, or even once every two or three months, he/we would have dozens of them online by now and there would no longer be arguments of such hypothetical types, but much more concretized. I must state for the record that I have encouraged him to this, pretty much every single year he has been here. I would encourage anyone/everyone else to do the same. It's all you would have to do to wrest "control" of PG from me, and then I could go invent something else. > it. We can offer some infrastructure support (disk space) and you are > welcome to invite Project Gutenberg volunteers to go help you actualize your > vision, but I will not substitute your vision for mine." Not quite right: What I will not do, as asked so many times, is to state for official record that YOU are the official boss of Project Gutenberg and that YOUR method IS THE ONLY OFFICIAL METHOD OF PROJECT GUTENBERG. > I am not prepared to say that Distributed Proofreaders has lost its vision. > It is still proofreading a lot of pages every day. It is clearly /not/ an > efficient process, but efficiency was not one of the project goals. We are > all familiar with the old saw that while one woman can have a baby in 9 > months that doesn't mean that 9 women can have a baby in one month. I don't No, but a group of women can have an average of one baby per month. When you are dealing with larger numbers it's not exactly the same. > believe that DP is saying "if one person can proofread a text in 10 days, > then 10 people can proofread it in one day," but they are saying "100 people > can proofread it in two days." Distributed Proofreaders goal was to increase > the speed that texts would be proofread, to lighten the load from any one > individual and to make the process more fault-tolerant (if one volunteer > quit, the project would not need to be restarted). Actually, 10 people CAN do that kind of job in one day, and have!!! However, it is nice to have both someone at the wheel and a substitute. > What has happened is that the needs of the consumer has changed. I'm fairly > certain that the proofread texts now sitting in DP's Post-Processing queue > would meet Michael Hart's standards (or lack thereof, as he is continually > telling me he has no standards) Again not quite right: It's not that I have no standards, I just don't force them on people. Even when it comes down to hard and fast accuracy percentages, I will state the accuracy level I hope for at any given time. Right now it is 99.975% Earlier it was 99.95% [co-opted by the Library of Congress, hee hee!] Before that it was 99.9%, but that was when I started with a version 0.1 not a version 1.0, and worked up to 1.0. > and could be released to Project Gutenberg as is. Other consumers, however, > have higher standards, and Distributed Proofreaders is now trying to satisfy > those standards as well, and those new standards require post-processing of > a work as a single unit by a single person. We always had a single person as the last post-processor. First it was me, then Judy Boss, then me again, then Greg Newby, then me again, then Newby again, etc., etc., etc. > DP's vision and expertise is in the area of distributed proofreading, not in > the area of efficient e-book creation. This is why texts languish in the > Post-Processing queue. > > Your problem, Mr. Adcock, is that you believe you can change the vision > underlying either of these organization through rational argument. Personally, I believe in rational argument, with stated premises followed by stated conclusions, stacked on top of each other to final conclusions. However, as many of you have undoubtedly note bened, when such arguments are put forth, the opposition ignores them in "fair and balanced" ways. [Just to make sure those who never heard of "fair and balanced" look it up] > Vision is an intuitive, almost religious, experience, and blind faith is > immune to rationality. It is virtually impossible that you will be able to > change the vision either of Mr. Hart or whomever is currently the visionary > at Distributed Proofreaders. While my faith in the whole of the eBook movmement and Open Source is pretty much unshakeable, it is a rational faith, not blind, based on the simple cost/benefit ratio. In then end just plain individuals can do all the eBooks and post them where seach engines can find them. It's nice to have large collections, but not necessary. > I suspect that this is why Roger Frank has created his own web site for > "roundless proofing;" his vision differs from that of Distributed > Proofreaders, and it was simply easier to go his own way than to try and > change someone else's vision. And so too could anyone else, with less effort, and more cooperation. However, doing it yourself has certain inalienable advantages!!! > I believe I agree with every criticism you have leveled at both Project > Gutenberg and Distributed Proofreaders, which is to say, I believe I accept > your vision. So let me mimic the words of Michael Hart: > > GO FOR IT! > > Put together your own project to complete high-quality public domain > e-books. You could certainly harvest all of the files currently in the DP > post-processing queue to start with. You might be able to grab the HTML > files from PG if you can find scans to go with them. Take advantage of the > hardware resources that Mr. Newby has offered. Post messages here and at DP > inviting volunteers to help you out. No need to return the e-books to either > of those organizations; if they want them they will know where to find them. > I will help out as much as possible. > > But please stop trying to convince Distributed Proofreaders or Project > Gutenberg to accept a new vision. They are old and are set in their ways. > They represent the last internet generation, not the current one. Show us > the way forward, and let sleeping dogs lie. I'm still interested in new visions, but just not those that tell me to do something YOU should be doing, even though I am willing to help. I am willing to help!!! Period. That's the bottom line. And you don't even have to give me or PG any credit. . . . From prosfilaes at gmail.com Wed Mar 3 17:30:30 2010 From: prosfilaes at gmail.com (David Starner) Date: Wed, 3 Mar 2010 20:30:30 -0500 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com> On Wed, Mar 3, 2010 at 1:24 PM, Jim Adcock wrote: >?I suspect when Google digitizes the book the > original is then trashed by the college library That would be silly. When you have endowments the size of the Harvard University, you have no need to do that; since they're already built book-storage buildings where books can stored in more space-efficient forms than browseable stacks and are retrievable only with a couple days' notice, you're simply more free to exile them there. > -- the whole point being > they do not want to have to pay to maintain physical library books in > various states of decay. The whole point of this thing is that Google thought that digitalizing this material would be valuable, and the universities all thought that it would be valuable to have digital copies of their collection, and that it would further their mission to spread knowledge. > ?Google then becomes the sole repository for this > information No. The universities all have copies of all the scans made from their books. > But to my mind one measure is > obvious: ?Books that real people do not in practice want to read we should > not bother to restore! Then we aren't doing enough porn. If your sole measure of worthiness is the number of hits, then forget about doing the works of Sarah Orne Jewett, let's start digging up all that erotica published in the 20s and 30s under the table and watch the Google hits come flying it. > As a simple measure at least the total > amount of time people spend reading the book has to exceed the amount of > time volunteers spend preparing the book, or it's a loss to society. It's not a loss to society to take time that would be used for watching TV and use it to restore books. It's not a loss to society if we make a work accessible to the right scholar, or if we inspire the right person. > But it is trivial to find a book to work on that will be > 50X more popular than the average book DP finishes. First, looking at the puerile crap (no offense intended) that comes up as done by you, I'm not sure you can find it. The first Slashdotting of DP, someone complained that among the little material we had available was my scan of "From October to Brest-Litovsk", but to this day, I think that book--history written with lightning--was one of the more important works I did, and probably more read too (someone did it for Librivox). In some sense, the single most popular work PG has has to be the 1913 Webster's, which has been borrowed as the basis of just about every online free dictionary, and referred to by people who don't even know that PG exists. And another major point is, what do DPers actually want to work on? Hard material tends to go through slowly, where as junk fiction tends to go through pretty quickly. That has nothing to do with the popularity or worthiness of the text. We could toss out a bunch of the "less worthy" books in exchange for the OED or porn, but I doubt that will increase DP production overall. -- Kie ekzistas vivo, ekzistas espero. From hart at pglaf.org Wed Mar 3 17:59:35 2010 From: hart at pglaf.org (Michael S. Hart) Date: Wed, 3 Mar 2010 17:59:35 -0800 (PST) Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com> Message-ID: Sorry, but lots of libraries are doing JUST that!!! Selling the books after digitizing. . . . I bought several volumes of the NY Herald when this was done. I will probably buy more. On Wed, 3 Mar 2010, David Starner wrote: > On Wed, Mar 3, 2010 at 1:24 PM, Jim Adcock wrote: > >?I suspect when Google digitizes the book the > > original is then trashed by the college library > > That would be silly. When you have endowments the size of the Harvard > University, you have no need to do that; since they're already built > book-storage buildings where books can stored in more space-efficient > forms than browseable stacks and are retrievable only with a couple > days' notice, you're simply more free to exile them there. > > > -- the whole point being > > they do not want to have to pay to maintain physical library books in > > various states of decay. > > The whole point of this thing is that Google thought that digitalizing > this material would be valuable, and the universities all thought that > it would be valuable to have digital copies of their collection, and > that it would further their mission to spread knowledge. > > > ?Google then becomes the sole repository for this > > information > > No. The universities all have copies of all the scans made from their books. > > > But to my mind one measure is > > obvious: ?Books that real people do not in practice want to read we should > > not bother to restore! > > Then we aren't doing enough porn. If your sole measure of worthiness > is the number of hits, then forget about doing the works of Sarah Orne > Jewett, let's start digging up all that erotica published in the 20s > and 30s under the table and watch the Google hits come flying it. > > > As a simple measure at least the total > > amount of time people spend reading the book has to exceed the amount of > > time volunteers spend preparing the book, or it's a loss to society. > > It's not a loss to society to take time that would be used for > watching TV and use it to restore books. It's not a loss to society if > we make a work accessible to the right scholar, or if we inspire the > right person. > > > But it is trivial to find a book to work on that will be > > 50X more popular than the average book DP finishes. > > First, looking at the puerile crap (no offense intended) that comes up > as done by you, I'm not sure you can find it. The first Slashdotting > of DP, someone complained that among the little material we had > available was my scan of "From October to Brest-Litovsk", but to this > day, I think that book--history written with lightning--was one of the > more important works I did, and probably more read too (someone did it > for Librivox). > > In some sense, the single most popular work PG has has to be the 1913 > Webster's, which has been borrowed as the basis of just about every > online free dictionary, and referred to by people who don't even know > that PG exists. > > And another major point is, what do DPers actually want to work on? > Hard material tends to go through slowly, where as junk fiction tends > to go through pretty quickly. That has nothing to do with the > popularity or worthiness of the text. We could toss out a bunch of the > "less worthy" books in exchange for the OED or porn, but I doubt that > will increase DP production overall. > > -- > Kie ekzistas vivo, ekzistas espero. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From dakretz at gmail.com Wed Mar 3 18:04:57 2010 From: dakretz at gmail.com (don kretz) Date: Wed, 3 Mar 2010 18:04:57 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com> Message-ID: <627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com> And another major point is, what do DPers actually want to work on? Hard material tends to go through slowly, where as junk fiction tends to go through pretty quickly. That has nothing to do with the popularity or worthiness of the text. We could toss out a bunch of the "less worthy" books in exchange for the OED or porn, but I doubt that will increase DP production overall. This is at least due to the urgency DP places on moving people out of their comfort zone. New people at every level are encouraged to choose (naturally enough) easy projects to climb the learning curve; and since virtually everyone is being encouraged to advance, this material comprises a larger portion than it would otherwise. To assist this, easy projects are released from the queues more quickly (again to encourage new skills). I've mentioned that no Shakespeare play has been released into F2 or processed into PG for several years, despite sitting in the F2 queue much of that time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From donovan at abs.net Wed Mar 3 18:21:49 2010 From: donovan at abs.net (D Garcia) Date: Wed, 3 Mar 2010 21:21:49 -0500 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: References: <30a2.40c90764.38bf8822@aol.com> <4B8ED97E.5020105@novomail.net> Message-ID: <201003032121.50581.donovan@abs.net> Jim/James: re: >... I submitted two books in good faith to DP which are now >stuck there indefinitely after I contributed many many hours of my own time Of your two projects, the first went from creation to completion of all rounds solely through the normal operation of the queues in two months, and is being post-processed/verified. If you have concerns as PM about specifics of the project or its status, you should contact the post-processor or the PP-verifier of the project via the several means available to you on the DP site. The second project has also gone from creation to completion of P1, P2, P3 and F1 in two months, also solely through the normal operation of the queues, and has been waiting in F2 for seven months. In the normal operation of the queues, this project would release about six weeks from now. That's pretty far from "indefinitely." I have taken the liberty of releasing this project into F2 where a group of F2 volunteers are focusing their efforts on it and will easily complete it before day's end, possibly before this post reaches the list. re: >... and I have no way to get those books back out. Since you are the project manager, you could have assigned yourself as post- processor and requested that it skip F2. However, as the F2'ers are finding and correcting formatting and other errors, it's probably better that you didn't. Project managers have options within the DP process, including, but not limited to those mentioned above, either of which could have progressed your project. Any of the DP project facilitators, db-req, dp-help, or admins could have heard your concerns and and discussed options with you, which could have saved much of the frustration which you've expressed on this list, had you only asked. David (donovan) James Adcock wrote: >"My problem", if we have to talk about my problems of which there >are many, is that I submitted two books in good faith to DP which are now >stuck there indefinitely after I contributed many many hours of my own time >and tears, and I have no way to get those books back out. From schultzk at uni-trier.de Thu Mar 4 01:36:41 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 4 Mar 2010 10:36:41 +0100 Subject: [gutvol-d] Re: DP/PG vs. Google In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de> Hi All, I step in here and reply to a couple posts at one time. Am 03.03.2010 um 20:42 schrieb Pascal J. Bourguignon: > > On 2010-03-03, at 19:24, Jim Adcock wrote: >> However, The DP/PG approach is extremely expensive compared to what Google >> is doing. Consider: Google Books == about 10 million books photo scanned. >> DP/PG == 30,000 books "fully restored." So Google's approach is about 300X >> faster than the DP/PG approach. My Conclusion: In the best of all world's >> there would be some measure of VALUE in choosing which books DP/PG chooses >> to put effort into fully restoring -- the idea that somehow DP/PG is going >> to be able to fully restore all the world's books is surely false. Google produces scan sets. Sure they put some in amore pleasurable form, But they are not interested producing books or even conserving them. The quality of the work is proof in fact. My personal opinion is that Google is simply interested in producing revenue, by what even means! That does not mean that Google does not have any merit. DP want to produce pleasurable eBooks. Personally, DP/PG has more value. > I think that the bet made by Google, is that sooner or later, sufficiently > smart AI and OCR technology will be developed to allow to process its scans > and do the job of PG automatically. I doubt this very much. AI proper has been dead since the failing of the ELISA project. Yes, the term is still used today to refer to anything that a computer does that seems to be intelligent. But it is hardly AI. In the 80s machine translation was all the fad. The japanese said they would have a MT system that would translate in real time your telephone conversations by the 90s. Well, here we are some 20 years later and we can have the most horrific translation made online. The standard is that of my introductory class I had in the 80s. Googles service does not even use the half of the developments made MT. > > The only question is when it will happen, and some think that singularity > will occur within 20 years. BB if it was realistic I would take you up on your bet. In 50 years their will not be a system finished that will do job of creating proper output anything above 95% fully automatically That is without any human interaction whatsoever.. Already, in the 90s it was said that with faster computers and cheaper storage the problems of knowledge engineering. Again, here we are and all is vaporware. It has been proven already in the 80s that human language Type 0 is and it is know that Type 0 can not be processed completely automatically by a computer. So the emphasis has change to simulate as much as possible. Yet, this will be always far from prefect. Sorry, for being here more than OT. But it was needed to prove the fact that anything having to do with language can not be handle by a computer program by itself. regards Keith. From hart at pglaf.org Thu Mar 4 07:02:07 2010 From: hart at pglaf.org (Michael S. Hart) Date: Thu, 4 Mar 2010 07:02:07 -0800 (PST) Subject: [gutvol-d] !@! I Take That Bet! Re: Re: DP/PG vs. Google In-Reply-To: <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de> Message-ID: BB if it was realistic I would take you up on your bet. In 50 years their will not be a system finished that will do job of creating proper output anything above 95% fully automatically That is without any human interaction whatsoever.. _I_ will take that bet!!! Even thought there are no realistic odds I will be here to collect. I will be only too glad to have the proceeds go to PG, or In Memoriam. The bet is that a Xerox machine type of scanning and OCR will produce a 95% accurate copy of certain pages selected from an average set of books, magazines, etc. Just go to a library and ask for samples. Fair enough??? Michael From Bowerbird at aol.com Thu Mar 4 08:00:40 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Mar 2010 11:00:40 EST Subject: [gutvol-d] Re: =?iso-8859-1?q?!=40!_I_Take_That_Bet!_Re=3A=A0_Re=3A_DP/PG_vs?= =?iso-8859-1?q?=2E_Google?= Message-ID: <5b5e0.56a961e8.38c13328@aol.com> michael said: > The bet is that a Xerox machine type of scanning and OCR > will produce a 95% accurate copy of certain pages selected > from an average set of books, magazines, etc.? > Just go to a library and ask for samples. that's not the bet at all. the bet is whether google can increase accuracy to 96% or 97%. we're not talking about the limits of scanning and o.c.r., people. we're talking about what a company with virtually unlimited funds and lots and lots and lots and lots of expertise with handling text can do _after_ they've scanned books and done o.c.r. on the scans, in order to improve the accuracy of that text. folks, we're talking about how well they can clean up their o.c.r. and i'm conservative by saying 96% or 97%... quite conservative. i've shown how useful it can be to compare two book digitizations. but for some editions of some books, google will have _many_ different digitizations, involving different physical copies taken from different physical libraries throughout the country, scanned by different machines, and perhaps processed using different o.c.r. they will certainly experiment with despeckling and resolution, and other variables, and should hit on a comparison combination which -- for their particular scans -- works remarkably effectively. they will also have tons of data on the types of errors that are made by their equipment, and knowing that _will_ help them fix the errors. but mostly just having _multiple_digitizations_ of the same edition of a book gives them the chance to raise accuracy through the roof. you guys want to tie google's hands in the same way yours are tied. but google's money and expertise mean they are _miles_ ahead... and eventually probably even light-years ahead... -bowerbird p.s. and the limitation on the bet that google can't use humans? why not? they have billions of pageviews every single day, not? why do you think they bought recaptcha and hired luis von ahn? they're not limited by the shackles that you want them to wear... -------------- next part -------------- An HTML attachment was scrubbed... URL: From lee at novomail.net Thu Mar 4 08:50:17 2010 From: lee at novomail.net (Lee Passey) Date: Thu, 04 Mar 2010 09:50:17 -0700 Subject: [gutvol-d] Re: roundlessness -- 010 In-Reply-To: <34f8a.2b48e537.38c047e9@aol.com> References: <34f8a.2b48e537.38c047e9@aol.com> Message-ID: <4B8FE4C9.5090605@novomail.net> On 3/3/2010 4:16 PM, Bowerbird at aol.com wrote: [snip] > the idea is that the proofer does the word-by-word scan > _not_ against a web-page's textfield version of the page, > but rather against an .html-realized version of the page... ... > the main benefit is that you free 'em from having to look > at the markup, because that's an unnecessary distraction. > they see actual rendered italics, not the markup for italics. FWIW, I like this idea very much. I think it meshes quite nicely with Mr. Frank's notion that markup should never be separated from its associated text. > it's also possible this way to red-flag any possible scannos, > as well as capitalization and punctuation improbabilities... > > you can also colorize quotations, which helps locate any > missing or incorrect quotemarks. This would have to be done subtly, so as not to influence users to make changes where they are inappropriate, but I think the idea has merit. [snip] > then, only if there are changes to be made will the proofer > summon the textfield for editing. The Kupu editor which is part of the Plone and Apache Lenya projects could be a very nice choice for this editor (good software engineers never want to reinvent wheels). In fact, I think it may be possible to use Lenya to build a prototype of this very sort of application. I'll do a little research and get back to you. From Bowerbird at aol.com Thu Mar 4 09:18:58 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Mar 2010 12:18:58 EST Subject: [gutvol-d] let them eat cake Message-ID: <61369.187e84e6.38c14582@aol.com> michael said: > Sorry, but lots of libraries are doing JUST that!!! > Selling the books after digitizing. . . . > I bought several volumes of the NY Herald when this was done. > I will probably buy more. i'm afraid this situation is already bad, and will certainly get worse. universities nationwide are coming into a huge money crunch and it costs a lot of money to house books. and the benefits of doing so are becoming ever less clear, since usage has dropped precipitously. (ends up if the kids can't get it online, they won't trudge to a library, they'll just do without. and not just the kids, but the _faculty_ too!) so yes, universities are building book warehouses for collections. (the u.c. southern regional library facility is on-campus at u.c.l.a.) but there's even more to the story. some google library partners like _michigan_ are forming co-ops (e.g., the hathi trust) that will offer scansets to other institutions. right now, obviously, they're aiming at colleges and universities, but it's fairly clear they will soon target research institutions and private schools, public schools in big cities, and big city libraries. every entity that is now funding a library (which is probably quite limited in scope and fairly expensive to maintain) will soon find they can instead get access to a much bigger corpus of material for a much cheaper price by subscribing to these rent-a-libraries. so they will all get rid of their paper-books. (can't afford both!) so the problem is not just the libraries where scans are made, but every single library across the entire country. now, in an ideal world, that would be great!, because we would all agree that everyone should have unlimited access to this library, just like they have unlimited access to their neighborhood library. but that's not what the moneychangers have in mind, no siree... this isn't a chance for society to save money. to the contrary, it's a way for the moneychangers to rob society. in addition to the fee they'll extract from each overall institution, they'll likely want to charge a fee to each individual user as well, perhaps even a per-page fee for every page every user views... some of you might think that that would be totally reasonable. you're mistaken. you're badly mistaken. you're very badly mistaken. the reason you're mistaken is that sharing these scans is a process that has very little variable cost. most of the costs were fixed costs. scanning, for instance, was a fixed cost, and a one-time cost at that. by the time these scans have been pushed out a dozen times, they will have paid their fixed costs... everything after that, and there'll be much usage after that, will be profit, pure profit, _excess_ profit. the moneychangers want to get paid over and over and over again. when you consider that these books were purchased and housed at _public_ expense -- some of 'em for well over a century -- this profiteering against the public's pocket is totally unconscionable. still, librarian bureaucrats have proven time and time and time again that they're complete idiots who will cut their own throats long-term to get even a questionable good in the short-term. so they will play along into this little con-game being played by the moneychangers, and the public will once again be left holding the bag of bills to pay... and the final upshot? we'll pay even more for books than we pay now, and the poor among us will find that their access is sharply curtailed... but, hey, what do poor people need books for anyway? let 'em eat cake. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Thu Mar 4 10:02:14 2010 From: jimad at msn.com (Jim Adcock) Date: Thu, 4 Mar 2010 10:02:14 -0800 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com> <627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com> Message-ID: Well, I guess I should stop complaining now because one of my DP texts has made it to PP and I was able to snag it back myself. But, I will point out its statistics on the latest round, and people can judge for themselves: This book sat on the F2 queue for 7,200 hours. It then went "live" in F2 status for 3 hours, which is how long it took 14 F2 volunteers to do all the pages. Since about 3 volunteers were working on the book at any given time, the total volunteer-hours spent on F2 was about 10. So the ratio of [time sitting on queue]/[volunteer-hours working on text] is about 700 to one. Is this a well-designed system? PS: This book WAS classified as "porn" when it first came out -- which may explain WHY the volunteers are interested in tackling it. I did tag it as containing material related to sexuality and infidelity in case anyone didn't want to work on those subjects. Nowadays the "porn" label would be a joke and the book is considered a classic of modern American literature. In defense of the DP volunteers the other book I have stuck in DP was tackled even more voraciously by DP volunteers -- and that one was never considered "porn." >Hard material tends to go through slowly, where as junk fiction tends to go through pretty quickly. Material can be hard AND junk. I am perfectly happy to work on hard stuff if it will actually get used by anyone. I spent some time proofing a hard book on DP [that was labeled "Easy"] that should have been titled "How to Torture a Horse." Put up OED and I will help tackle it. I would also be happy to put up "Outline of Science Vol. II" which is hard AND popular -- if DP were willing to get it out the door in say a year or less. >I've mentioned that no Shakespeare play has been released into F2 or processed into PG for several years, despite sitting in the F2 queue much of that time. And I would also be willing to work on the bard. Again, I won't be one processing him *into* DP unless I have some assurance that he's ever going to come *out* again! From Bowerbird at aol.com Thu Mar 4 10:09:32 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Mar 2010 13:09:32 EST Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts Message-ID: <64a88.4b99698d.38c1515c@aol.com> jim said: > Well, I guess I should stop complaining now why? i thought your complaints were made on a principle. you were just bellyaching for a speed-up exception favor? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Thu Mar 4 10:25:34 2010 From: jimad at msn.com (Jim Adcock) Date: Thu, 4 Mar 2010 10:25:34 -0800 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <201003032121.50581.donovan@abs.net> References: <30a2.40c90764.38bf8822@aol.com> <4B8ED97E.5020105@novomail.net> <201003032121.50581.donovan@abs.net> Message-ID: >Of your two projects, the first went from creation to completion of all rounds solely through the normal operation of the queues in two months, and is being post-processed/verified. If you have concerns as PM about specifics of the project or its status, you should contact the post-processor or the PP-verifier of the project via the several means available to you on the DP site. I have certainly done so, and have been told that it is "normal" for a PP to take a long time at DP and that it would not be nice to keep asking the PP every three months or so "how's it going." >Any of the DP project facilitators, db-req, dp-help, or admins could have heard your concerns and and discussed options with you, which could have saved much of the frustration which you've expressed on this list, had you only asked. I did ask, and I was told that there was nothing I could do to expedite the process and that these delays are normal, and what I should do is spend my time and energy sticking more projects into the front end of the queue. From jimad at msn.com Thu Mar 4 10:34:09 2010 From: jimad at msn.com (Jim Adcock) Date: Thu, 4 Mar 2010 10:34:09 -0800 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <64a88.4b99698d.38c1515c@aol.com> References: <64a88.4b99698d.38c1515c@aol.com> Message-ID: >jim said: >> Well, I guess I should stop complaining now > >why? i thought your complaints were made on a principle. >you were just bellyaching for a speed-up exception favor? ...said tongue in cheek. I haven't stopped complaining that the system really doesn't work the way it's currently designed. Although I don't see why pointing out that the system doesn't work as designed should be considered "bellyaching" any more than telling a webmaster that their server is down is considered "bellyaching"! Agreed that the phone company considers that I am "bellyaching" when I call them to say that my phone service isn't working and then I tell them that *their* phone service isn't working either! [which typically takes about three hours because their queuing systems don't work either....] From marcello at perathoner.de Thu Mar 4 10:50:44 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 04 Mar 2010 19:50:44 +0100 Subject: [gutvol-d] Re: !@! I Take That Bet! Re: Re: DP/PG vs. Google In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de> Message-ID: <4B900104.2020702@perathoner.de> Michael S. Hart wrote: > > BB if it was realistic I would take you up on your bet. In 50 years their will > not be a system finished that will do job of creating proper output anything > above 95% fully automatically That is without any human interaction whatsoever.. > > _I_ will take that bet!!! > > Even thought there are no realistic odds I will be here to collect. > > I will be only too glad to have the proceeds go to PG, or In Memoriam. > > The bet is that a Xerox machine type of scanning and OCR will produce > a 95% accurate copy of certain pages selected from an average set of > books, magazines, etc. Just go to a library and ask for samples. Accuracy of OCR already exceeds 99%. Send me the money. -- Marcello Perathoner webmaster at gutenberg.org From lee at novomail.net Thu Mar 4 10:57:18 2010 From: lee at novomail.net (Lee Passey) Date: Thu, 04 Mar 2010 11:57:18 -0700 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <201003032121.50581.donovan@abs.net> References: <30a2.40c90764.38bf8822@aol.com> <4B8ED97E.5020105@novomail.net> <201003032121.50581.donovan@abs.net> Message-ID: <4B90028E.4020103@novomail.net> On 3/3/2010 7:21 PM, D Garcia wrote: [snip] > I have taken the liberty of releasing this project into F2 where a group of F2 > volunteers are focusing their efforts on it and will easily complete it before > day's end, possibly before this post reaches the list. Man, you've got to love it! Mr. Adcock points out that the production process at Distributed Proofreaders is broken, and offers a sample demonstrating /how/ it is broken. In response, Mr. Garcia removes the sample from the standard process and deals with it as a special case. In other words, instead of trying to fix the broken process, Mr. Garcia has simply tried to neutralize the complaint! I've worked at a number of few different companies for which this was just Standard Operating Procedure ... most of whom are no longer with us. From marcello at perathoner.de Thu Mar 4 11:02:24 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 04 Mar 2010 20:02:24 +0100 Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com> <627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com> Message-ID: <4B9003C0.9050106@perathoner.de> Jim Adcock wrote: > PS: This book WAS classified as "porn" when it first came out -- which may > explain WHY the volunteers are interested in tackling it. Personally I'd like to see more porn on PG, we still lack most of De Sade. -- Marcello Perathoner webmaster at gutenberg.org From lee at novomail.net Thu Mar 4 11:21:06 2010 From: lee at novomail.net (Lee Passey) Date: Thu, 04 Mar 2010 12:21:06 -0700 Subject: [gutvol-d] Re: !@! I Take That Bet! Re: Re: DP/PG vs. Google In-Reply-To: <4B900104.2020702@perathoner.de> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de> <4B900104.2020702@perathoner.de> Message-ID: <4B900822.9070006@novomail.net> On 3/4/2010 11:50 AM, Marcello Perathoner wrote: > Michael S. Hart wrote: [snip] >> The bet is that a Xerox machine type of scanning and OCR will produce >> a 95% accurate copy of certain pages selected from an average set of >> books, magazines, etc. Just go to a library and ask for samples. > > Accuracy of OCR already exceeds 99%. Absolutely. According to what I learned in typing class (yes, I really am that old) a standard typewritten sheet of paper averages 72 lines of 66 characters each, resulting in 4752 characters per page. Based solely on a per character basis 99% accuracy would allow 47 errors per page. Modern OCR, even that POS that IA uses, gives better accuracy than that. If you choose to look at words instead of characters, it is generally accepted that the average word length is 6 characters, for an average of 9.5 words per line (I have omitted spaces which is why it is not 11 words per line). This results in an average of 679 words per page, which at 99% accuracy would allow for 6 misrecognized /words/ per page. That is still well within the recognition accuracy of modern OCR. Personally, I find bowerbird's stated goal of 1 error per 10 pages a worthwhile goal. This is actually an accuracy rate (based upon words) of 99.9998527%. So maybe the bet ought to be when automated OCR will exceed four 9s of accuracy (basically one word error per page). Some of the recent work I have done, from my own scans, already reaches that threshold. (Accuracy will, of course, vary depending on the quality of the scanned image. YMMV and all that jazz.) From hart at pglaf.org Thu Mar 4 12:13:37 2010 From: hart at pglaf.org (Michael S. Hart) Date: Thu, 4 Mar 2010 12:13:37 -0800 (PST) Subject: [gutvol-d] Re: !@! I Take That Bet! Re: Re: DP/PG vs. Google In-Reply-To: <4B900104.2020702@perathoner.de> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de> <4B900104.2020702@perathoner.de> Message-ID: Marcello don't you ever READ anything before replying???!!! Still???!!! "In 50 years there will NOT be as system. . .above 95%. . ." I took that bet, betting there WILL be. . .SHOW ME THE MONEY!!! How do you expect anyone to EVER take you seriously when you do this kind of thing over, and over, and over. . .???!!! On Thu, 4 Mar 2010, Marcello Perathoner wrote: > Michael S. Hart wrote: > > > > BB if it was realistic I would take you up on your bet. In 50 years their > > will > > not be a system finished that will do job of creating proper output anything > > above 95% fully automatically That is without any human interaction > > whatsoever.. > > > > _I_ will take that bet!!! > > > > Even thought there are no realistic odds I will be here to collect. > > > > I will be only too glad to have the proceeds go to PG, or In Memoriam. > > > > The bet is that a Xerox machine type of scanning and OCR will produce > > a 95% accurate copy of certain pages selected from an average set of > > books, magazines, etc. Just go to a library and ask for samples. > > Accuracy of OCR already exceeds 99%. > > Send me the money. > > > From schultzk at uni-trier.de Thu Mar 4 12:37:29 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 4 Mar 2010 21:37:29 +0100 Subject: [gutvol-d] Re: !@! I Take That Bet! Re: Re: DP/PG vs. Google In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de> Message-ID: One big problem, You dio not stil a a PG or DP text ebook. You do have any markup what so even! Plus, what happens if you give them the Google scan sets!! I have work with OCR that will get me 100% text accuracy, but it took a hell alot of training, aka human interaction. Also, OCR today achieves their accuracy from dictionaries and guessing at the correct spelling. Which under many circumstances this type of heuristics causes a quite a few errors. regards Keith. Am 04.03.2010 um 16:02 schrieb Michael S. Hart: > > > BB if it was realistic I would take you up on your bet. In 50 years their will > not be a system finished that will do job of creating proper output anything > above 95% fully automatically That is without any human interaction whatsoever.. > > _I_ will take that bet!!! > > Even thought there are no realistic odds I will be here to collect. > > I will be only too glad to have the proceeds go to PG, or In Memoriam. > > The bet is that a Xerox machine type of scanning and OCR will produce > a 95% accurate copy of certain pages selected from an average set of > books, magazines, etc. Just go to a library and ask for samples. > > Fair enough??? > > > Michael > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From donovan at abs.net Thu Mar 4 12:41:18 2010 From: donovan at abs.net (D Garcia) Date: Thu, 4 Mar 2010 15:41:18 -0500 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <4B90028E.4020103@novomail.net> References: <30a2.40c90764.38bf8822@aol.com> <201003032121.50581.donovan@abs.net> <4B90028E.4020103@novomail.net> Message-ID: <201003041541.18286.donovan@abs.net> Lee Passey wrote: >Mr. Adcock points out that the production process at Distributed >Proofreaders is broken, and offers a sample demonstrating how it is >broken. In response, Mr. Garcia removes the sample from the standard >process and deals with it as a special case. In other words, instead of >trying to fix the broken process, Mr. Garcia has simply tried to >neutralize the complaint! It's telling that based on zero knowledge you first assume (wrongly) that I am not working on improving the DP process, and then compound the error by assuming that addressing a volunteers issue constitutes "neutralizing" a complaint, all the while ignoring the rest of the message which outlined the full situation instead of the narrowly spun perspective you present. I'm sorry you believe that DP has nefarious intent in responding to a situation where a volunteer believed they had no recourse. Since I can't believe that you think ignoring that issue would have somehow been better, I am forced to conclude that your only concern is the spin you've tried to put on it. Congratulations on winning a bet for me that someone would attempt to do exactly that. :) David (donovan) From dakretz at gmail.com Thu Mar 4 13:07:54 2010 From: dakretz at gmail.com (don kretz) Date: Thu, 4 Mar 2010 13:07:54 -0800 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <201003041541.18286.donovan@abs.net> References: <30a2.40c90764.38bf8822@aol.com> <201003032121.50581.donovan@abs.net> <4B90028E.4020103@novomail.net> <201003041541.18286.donovan@abs.net> Message-ID: <627d59b81003041307o2461de16ye52446d30fa19b51@mail.gmail.com> I will in this case vouch for at least part of the representation given by David (donovan). What you experienced is in fact the primary method employed by the DP process managers for trying to ameliorate the consequences of their system. When a perceived deficiency is detected, it is defined as a "special case" and given "special treatment". So your first project probably qualifies as a "First Project", and therefore has access to a good deal of standard "special treatment" that you might not have been aware of (though it was your responsibility to be so, unfortunately.) Your second project may have also been qualified for another standard "special treatment"; I'm not very familiar with all the nuances, but put he certainly is - as he points out, he is one of those primarily responsible for it. (In tact, it's also true that he is one of the primary gatekeeepers for innovation and process improvement generally.) It's too bad you had the misfortune to be advised by someone not familiar with the proper navigation of the dp process. As is apparently also the case for whoever is responsible for the Shakespeare projects. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 4 14:13:40 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Mar 2010 17:13:40 EST Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts Message-ID: <75210.438f65cb.38c18a94@aol.com> jim said: > Although I don't see why pointing out that the system > doesn't work as designed should be considered "bellyaching" ...also said tongue in cheek... ;+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 4 15:10:32 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Mar 2010 18:10:32 EST Subject: [gutvol-d] Re: =?iso-8859-1?q?!=40!_I_Take_That_Bet!_Re=3A=A0_Re=3A_DP/PG_vs?= =?iso-8859-1?q?=2E_Google?= Message-ID: <78ae8.531a7bed.38c197e8@aol.com> keith said: > One big problem, > You dio not stil a a PG or DP text ebook. there seems to be a transcription difficulty there... that's ok, it happens even to the human among us. > You do have any markup what so even! who needs markup? o.c.r. can manage italics, not so well, agreed, but still. as for the structural aspects of a text, like chapter-heads and block-quotes and stuff like that, google is already showing that they are capable of handling such things... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From lee at novomail.net Thu Mar 4 15:15:12 2010 From: lee at novomail.net (Lee Passey) Date: Thu, 04 Mar 2010 16:15:12 -0700 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <201003041541.18286.donovan@abs.net> References: <30a2.40c90764.38bf8822@aol.com> <201003032121.50581.donovan@abs.net> <4B90028E.4020103@novomail.net> <201003041541.18286.donovan@abs.net> Message-ID: <4B903F00.3050704@novomail.net> On 3/4/2010 1:41 PM, D Garcia wrote: > I'm sorry you believe that DP has nefarious intent in responding to a > situation where a volunteer believed they had no recourse. I would never suggest than anyone at DP has nefarious intent without incontrovertible evidence, and I do not do so now; I'm a firm believer in Hanlon's razor. > Since I can't > believe that you think ignoring that issue would have somehow been better, I > am forced to conclude that your only concern is the spin you've tried to put > on it. First of all, I don't think it is the responsibility of anyone at Distributed Proofreaders to make sure that any particular volunteer is satisfied. I'm confident that Mr. Adcock is a competent producer of e-texts, and if you were to have told him, "look, we're doing the best we can here, but if you want to pick up the project on your own here's where you can get everything we've done up until now," that would have been sufficient. However, I certainly don't believe that ignoring the issue would have been a better option; instead I believe that addressing the issue head-on would have been better. There is an adage in Washington that "sunshine is the best disinfectant." Likewise, I believe that transparency is the best defense. First of all, we must recognize that the problem that Mr. Adcock was complaining of was /not necessarily/ that two of his projects remained in the Post-Processing queue for an unduly long time. Rather the problem he identified is that somehow the current production processes allow /any and all/ projects to become backed up in that queue. Given that problem statement, I would have liked to have seen something more like one of the following responses: 1. "There are no problems with the processes at Distributed Proofreaders. If you don't like the way we do things here, you don't have to participate." or, 2. "We recognize that there is a problem but we can't seem to agree upon the cause. We'll keep you informed as to the results of our inquiry. In the meantime, here's where you or the public-at-large can retrieve /all/ of the pieces of the stuck projects so you can take one and move it forward outside of the aegis of DP if you like." or, 3. "We recognize that there is a problem in our production process and we think we have identified the cause, which is [fill in the cause here]. As of yet we have not agreed on the best way to reform the process, but we'll keep you informed as to our progress. In the meantime, here's where you or the pubic-at-large, etc..." or, 4. "We have identified a problem in our production process, and believe it can be resolved by [fill in the proposed resolution here]. Please be patient while we see if this proposal resolves the backlog. If it does not, we will resume our search for the underlying problem, and in the meantime, here's where you or the public-at-large, etc..." Instead we saw a response more along the lines of: "We can not confirm or deny the existence of any problems in the production processes of Distributed Proofreaders, nor can we confirm or deny that we have identified any of the causes for these problems which may or may not exist. We may or may not have agreed upon what may or may not be a solution to these unidentified, alleged problems, but there is a possibility that we might change our process in unspecified ways. Or not. But as a special favor to you we will extract the two projects you are interested in to route around the damage, which may or may not exist, and process them using an entirely different procedure so that you will be satisfied." It seems to me that this kind of response is designed, in fact, to ignore the issue at hand, which is that changes need to be made at D.P. to increase the throughput of e-texts. Now it very well may be that this problem has already been recognized by The Powers That Be, and that a solution will be in place Real Soon Now. In that case, wouldn't it have been better to just say so? > Congratulations on winning a bet for me that someone would attempt to do > exactly that. :) I'm always happy to help. ;-) From sly at victoria.tc.ca Thu Mar 4 15:23:53 2010 From: sly at victoria.tc.ca (Andrew Sly) Date: Thu, 4 Mar 2010 15:23:53 -0800 (PST) Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: <4B9003C0.9050106@perathoner.de> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com> <627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com> <4B9003C0.9050106@perathoner.de> Message-ID: On Thu, 4 Mar 2010, Marcello Perathoner wrote: > Jim Adcock wrote: > > > PS: This book WAS classified as "porn" when it first came out -- which may > > explain WHY the volunteers are interested in tackling it. > > Personally I'd like to see more porn on PG, we still lack most of De Sade. > I recall seeing something just recently, as I was looking up author names... it's in German too... Here we go: Josefine Mutzenbacher http://www.gutenberg.org/etext/31284 Looks like it came through DP. --Andrew From Bowerbird at aol.com Thu Mar 4 15:37:44 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Mar 2010 18:37:44 EST Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) Message-ID: <7a9ea.45b89c9.38c19e48@aol.com> dkretz said: > What you experienced is in fact the primary method > employed by the DP process managers for > trying to ameliorate the consequences of their system. > When a perceived deficiency is detected, it is > defined as a "special case" and given "special treatment". it's actually become quite humorous to see all the efforts to "route around the damage" caused by the bad workflow. they're repeating rounds, skipping rounds, limiting people, cajoling people, it's a cavalcade of exceptions to the rule... and hey, it makes perfect sense to "make someone happy" (i.e., shut them up) if you can do it by doing them a favor. meanwhile, if you're not a "special case" and you don't get "special treatment", you'd better enjoy the end of the line. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From donovan at abs.net Thu Mar 4 18:07:42 2010 From: donovan at abs.net (D Garcia) Date: Thu, 4 Mar 2010 21:07:42 -0500 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <4B903F00.3050704@novomail.net> References: <30a2.40c90764.38bf8822@aol.com> <201003041541.18286.donovan@abs.net> <4B903F00.3050704@novomail.net> Message-ID: <201003042107.43123.donovan@abs.net> Lee Passey wrote: >... Rather the problem [Jim] identified is that somehow the current > production processes allow /any and all/ projects to become backed up in > that queue. Given that problem statement, I would have liked to have seen > something more like one of the following responses: [Lee's four hypothetical responses omitted] >Instead we saw a response more along the lines of: [Paraphrase of my actual response omitted] I addressed Jim's issue here solely because gutvol-d is where he raised it. >It seems to me that this kind of response is designed, in fact, to >ignore the issue at hand, which is that changes need to be made at D.P. >to increase the throughput of e-texts. I don't believe that there is anyone at DP at any level of participation who is unaware of the need for improvements in the process. However, the variously proposed "solutions" run the gamut from the obviously naive/simplistic, through horribly manual kludges, all the way up to byzantine complexities requiring considerable effort from the entire volunteer base. Your statement even reflects a common barrier to getting a grip on the issues: "to increase the throughput of e-texts" is actually a statement of goal. While increasing the throughput of the process should be and is a component of DP's long-term goals, the specific problems and their underlying causes need to be identified first in order to effectively address them. >Now it very well may be that this problem has already been recognized by >The Powers That Be, and that a solution will be in place Real Soon Now. >In that case, wouldn't it have been better to just say so? I'm certain some of these problems have been identified and the underlying causes and potential solutions are being examined, but I'm equally certain that no consensus can be achieved within the DP community as to what the causes are, much less what solutions are feasible, achievable, or even desirable. Whatever solutions do eventually result, interim or otherwise, some fraction of volunteers at DP will disagree. By no means am I ignoring the broader issues at DP--but I feel those are usually better discussed in a more appropriate venue. Overall, experience has shown that any discussion of DP on gutvol-d generally and unfortunately serves little productive purpose. While positive and insightful comments do occur, (and are read and appreciated!), they are easily lost in the background of posts which far too often contain derision, belittlement, accusation, and misrepresentation. One almost wonders whatever happened to basic respect. But then I remember the synergistic relationship between media and popular culture (of which old books are an excellent reminder). :) David (donovan) From jimad at msn.com Thu Mar 4 18:28:08 2010 From: jimad at msn.com (James Adcock) Date: Thu, 4 Mar 2010 18:28:08 -0800 Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not listen, pray) In-Reply-To: <201003042107.43123.donovan@abs.net> References: <30a2.40c90764.38bf8822@aol.com> <201003041541.18286.donovan@abs.net> <4B903F00.3050704@novomail.net> <201003042107.43123.donovan@abs.net> Message-ID: >I addressed Jim's issue here solely because gutvol-d is where he raised it. I also raised the issue on two DP forums which were discussing the issue, where my points have been discussed, with less heat generated perhaps, but also generating less light, and certainly no less action. Sorry to find these issues are still so controversial in NFPs -- these issues were controversial in industry when Deming first applied them to Japan quality issues in the 1950s, and again in US industries in the 1980s -- nowadays these principles are almost universally applied: JIT means no investment locked up unused, and no place for a LACK of quality to hide. JIT also keeps people busy rather than idled. From prosfilaes at gmail.com Thu Mar 4 18:58:28 2010 From: prosfilaes at gmail.com (David Starner) Date: Thu, 4 Mar 2010 21:58:28 -0500 Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts In-Reply-To: References: <64a88.4b99698d.38c1515c@aol.com> Message-ID: <6d99d1fd1003041858i252eddeev1a29eb697d834482@mail.gmail.com> On Thu, Mar 4, 2010 at 1:34 PM, Jim Adcock wrote: > Although I don't see > why pointing out that the system doesn't work as designed should be > considered "bellyaching" any more than telling a webmaster that their server > is down is considered "bellyaching"! Go, hyperbole! Repeatedly complaining about anything that works, but is too complex to work as designed, is bellyaching. DP is not down; it does work. It in fact works a heck of a lot better than originally designed. -- Kie ekzistas vivo, ekzistas espero. From hart at pglaf.org Thu Mar 4 19:36:57 2010 From: hart at pglaf.org (Michael S. Hart) Date: Thu, 4 Mar 2010 19:36:57 -0800 (PST) Subject: [gutvol-d] Was FOSS: Heinlein's Razor In-Reply-To: <4B903F00.3050704@novomail.net> References: <30a2.40c90764.38bf8822@aol.com> <201003032121.50581.donovan@abs.net> <4B90028E.4020103@novomail.net> <201003041541.18286.donovan@abs.net> <4B903F00.3050704@novomail.net> Message-ID: Robert A. Heinlein said it 1941. Not original with Robert J. Hanlon [many suspect error from Robert Heinlein to Robert Hanlon] Napoleon might have said something like it. From vze3rknp at verizon.net Fri Mar 5 07:16:56 2010 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Fri, 05 Mar 2010 10:16:56 -0500 Subject: [gutvol-d] Re: Preservation in the big scanning projects In-Reply-To: References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> Message-ID: <4B912068.6060003@verizon.net> On 3/3/2010 1:24 PM, Jim Adcock wrote: > > The question is, in my mind, is Google preserving the books, and doing so > for the public good or not? I suspect when Google digitizes the book the > original is then trashed by the college library -- the whole point being > they do not want to have to pay to maintain physical library books in > various states of decay. Google then becomes the sole repository for this > information -- excepting a smallish number of copies at TIA. This is absolutely not true. First of all, part of every agreement between a library and Google is that the library gets a copy of all the scans that Google makes. Depending on the exact contract, there may or may not be some restrictions on what the library can do with the scans, but they definitely get them. Further, the libraries do not get rid of the books. In fact, they are very protective of their books, which is why a face-up, human controlled scanning method is used (thus resulting in the occasional hand or finger in the scan). All books are returned to the libraries with as little wear as possible. For logistical reasons, both Google and the Internet Archive started with books that were in off-site repositories, but those repositories are not being removed. The librarians in charge of the scanning projects all understand that what Google is providing is a search tool, not preservation. The Internet Archive is much closer to doing archival quality work, but the libraries are still keeping the books. Remember, these librarians were burned by the promise of microfilm and microfiche as more compact storage formats for periodicals and such. A bunch of major libraries have put together a consortium called the Hathi Trust which has the explicit purpose of making sure that book scans are not lost. It provides off-site, secure storage for what the participant libraries want to put there. This includes the libraries' copies of the Google scans, as well as whatever else they decide to include. The last I was aware, the Hathi Trust did not do much, if anything, to provide public access to those scans, since that is not its purpose. I mention it here only to make folks aware that the libraries are making provision for storage even if places like Google, the Internet Archive, or, indeed, one of their own members, should disappear. I now return you to your arguments about DP. Juliet Sutherland From vze3rknp at verizon.net Fri Mar 5 07:52:39 2010 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Fri, 05 Mar 2010 10:52:39 -0500 Subject: [gutvol-d] Re: DP/PG vs. Google In-Reply-To: References: <3103a.2b657b5c.38c03a05@aol.com> Message-ID: <4B9128C7.8040504@verizon.net> On 3/3/2010 6:03 PM, James Adcock wrote: > > >do you want to bet against google? > > >because i'll take that bet against you. > > Sure, I'd be happy to take that bet, if I am allowed to win it or lose > it in a finite amount of time -- such as a decade. What I think is > much more likely in a decade is that Google is either gives up or they > figure out how to post much more attractive page images. I actually > don't think they have much of any interest in posting higher quality > automatic OCR transcriptions. > Wrong again. Google is funding development of open source OCR software via project called ocropus. I believe a beta version is due out shortly. Further, Google bought ReCaptcha. That's the company and software that make you prove you are human on many websites. They provide two scanned words, one known and one not. The human types in both. This works well because what is hard for OCR software, eg a computer, is often easy for a human. Over millions of comparisons they are able to build up a pretty good version of the text. Since they don't address punctuation, and because capital and non-capital letters, and some blobs, can be hard to recognize out of context, they won't get the text perfect. But they can turn something from total gibberish into readable text. I believe that there will always be a place for humans in preparing etext versions of some books. But, just as OCR eventually became good enough to start with, eventually technology will improve enough humans will add value only on very difficult texts, or by contributing semantic information. I don't know when that will happen, but it is certainly coming. Juliet Sutherland -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Mar 5 09:31:57 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 5 Mar 2010 12:31:57 EST Subject: [gutvol-d] roundlessness -- 011 Message-ID: <14b8f.f185aa8.38c29a0d@aol.com> ack! rfrank has started "archiving" the books from his roundless site... it sounds like he'll be deleting the scans when a book posts to p.g., even though, so far, he has _not_ posted those scans to p.g. as well. please won't somebody tell him p.g. will mount those files for him? (and there are lots of i.s.p. who offer huge amounts of storage and bandwidth nowadays at a very cheap price, like dreamhost; there is absolutely no need to delete files for "space" reasons.) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From hart at pglaf.org Fri Mar 5 18:07:45 2010 From: hart at pglaf.org (Michael S. Hart) Date: Fri, 5 Mar 2010 18:07:45 -0800 (PST) Subject: [gutvol-d] Re: Preservation in the big scanning projects In-Reply-To: <4B912068.6060003@verizon.net> References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net> <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com> <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com> <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com> <4B912068.6060003@verizon.net> Message-ID: If you actually visit the library archives working with Google, you should be able to find out that what was promised is not an entirely true case when it comes to reality. . .at least in POV of the librarians who will speak to you freely. Of course, I will also be the first to admit that you can get a number of librarians from the same institution who will say all is perfectly well. But it's not perfect. . .not down at the lower level realities, not where the rubber meets the road. I do note that the ones who say all is well and dandy are those with political and academic aspirations, and those who tell you things are not what they should be are more street level. We have plenty of both here at the University of Illinois. ;=) On Fri, 5 Mar 2010, Juliet Sutherland wrote: > > > On 3/3/2010 1:24 PM, Jim Adcock wrote: > > > > The question is, in my mind, is Google preserving the books, and doing so > > for the public good or not? I suspect when Google digitizes the book the > > original is then trashed by the college library -- the whole point being > > they do not want to have to pay to maintain physical library books in > > various states of decay. Google then becomes the sole repository for this > > information -- excepting a smallish number of copies at TIA. > This is absolutely not true. First of all, part of every agreement between a > library and Google is that the library gets a copy of all the scans that > Google makes. Depending on the exact contract, there may or may not be some > restrictions on what the library can do with the scans, but they definitely > get them. > > Further, the libraries do not get rid of the books. In fact, they are very > protective of their books, which is why a face-up, human controlled scanning > method is used (thus resulting in the occasional hand or finger in the scan). > All books are returned to the libraries with as little wear as possible. For > logistical reasons, both Google and the Internet Archive started with books > that were in off-site repositories, but those repositories are not being > removed. The librarians in charge of the scanning projects all understand that > what Google is providing is a search tool, not preservation. The Internet > Archive is much closer to doing archival quality work, but the libraries are > still keeping the books. Remember, these librarians were burned by the promise > of microfilm and microfiche as more compact storage formats for periodicals > and such. > > A bunch of major libraries have put together a consortium called the Hathi > Trust which has the explicit purpose of making sure that book scans are not > lost. It provides off-site, secure storage for what the participant libraries > want to put there. This includes the libraries' copies of the Google scans, as > well as whatever else they decide to include. The last I was aware, the Hathi > Trust did not do much, if anything, to provide public access to those scans, > since that is not its purpose. I mention it here only to make folks aware that > the libraries are making provision for storage even if places like Google, the > Internet Archive, or, indeed, one of their own members, should disappear. > > I now return you to your arguments about DP. > > Juliet Sutherland > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From hart at pglaf.org Fri Mar 5 18:08:40 2010 From: hart at pglaf.org (Michael S. Hart) Date: Fri, 5 Mar 2010 18:08:40 -0800 (PST) Subject: [gutvol-d] Re: roundlessness -- 011 In-Reply-To: <14b8f.f185aa8.38c29a0d@aol.com> References: <14b8f.f185aa8.38c29a0d@aol.com> Message-ID: Just to make it "official". . .we will save all scans sent. On Fri, 5 Mar 2010, Bowerbird at aol.com wrote: > ack! > > rfrank has started "archiving" the books from his roundless site... > it sounds like he'll be deleting the scans when a book posts to p.g., > even though, so far, he has _not_ posted those scans to p.g. as well. > > please won't somebody tell him p.g. will mount those files for him? > > (and there are lots of i.s.p. who offer huge amounts of storage > and bandwidth nowadays at a very cheap price, like dreamhost; > there is absolutely no need to delete files for "space" reasons.) > > -bowerbird > > From Bowerbird at aol.com Sat Mar 6 12:07:29 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 6 Mar 2010 15:07:29 EST Subject: [gutvol-d] Re: roundlessness -- 011 Message-ID: <58190.5b1478bd.38c41001@aol.com> michael said: > Just to make it "official". . .we will save all scans sent. roger doesn't appear to be interested in submitting scans yet... (d.p. reluctance on this perverts all who come in contact with it.) would you mount scans for his books that _i_ sent in? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Sat Mar 6 13:29:15 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sat, 6 Mar 2010 13:29:15 -0800 Subject: [gutvol-d] Re: roundlessness -- 011 References: <58190.5b1478bd.38c41001@aol.com> Message-ID: <2335A1A5E23445C2AD8B4E25E21FB2B7@alp2400> Seems to me that if Roger has files of any kind on his personal (or personally paid for) server, those files are his, to do with as he wishes. Their content is irrelevant. If he chooses to submit them to PG, fine; if not, that's his choice. IMO - what bowerbird is proposing is outright theft. His apparent "holier-than-thou" attitude doesn't make up for that. Speaking both personally and as a Whitewasher, I wouldn't touch such files with the proverbial ten-foot pole. Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; bowerbird at aol.com Sent: Saturday, March 06, 2010 12:07 PM Subject: [gutvol-d] Re: roundlessness -- 011 michael said: > Just to make it "official". . .we will save all scans sent. roger doesn't appear to be interested in submitting scans yet... (d.p. reluctance on this perverts all who come in contact with it.) would you mount scans for his books that _i_ sent in? -bowerbird ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Mar 7 12:26:26 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 7 Mar 2010 15:26:26 EST Subject: [gutvol-d] al and his 10-foot pole Message-ID: <89722.3405ffe2.38c565f2@aol.com> al said: > IMO - what bowerbird is proposing is outright theft. i guess it's silly season here on the project gutenberg listserve. people are quite casual lately tossing about the "t" word ("theft!"). won't somebody please come train the firehose on this attack-dog so i don't have to? some elementary public-domain f.a.q. will do. in the meantime, as i have "confessed" before, i've mounted some of roger's scansets on my own site, so if roger has a problem with me doing that, he should send me a cease-and-desist. (kidding, of course, which should be obvious since i hate lawyers so much; roger can just send an e-mail, and we can discuss it all friendly.) > His apparent "holier-than-thou" attitude doesn't make up for that. well, at least when i accuse someone of a moral shortcoming, i provide _evidence_, instead of just making some bogus claim. > Speaking both personally and as a Whitewasher, > I wouldn't touch such files with the proverbial ten-foot pole. gee, al, we certainly wouldn't want you, either personally or as a capital-w whitewasher, to get involved with _criminal_activity._ one day you're letting your parking meter run out prematurely, and the next day you're dealing in public-domain scansets, and the next day the russian mafia has got your sorry ass in a sling. can't be too careful. good thing you have a 10-foot pole handy. *** donovan said: > experience has shown that any discussion of DP on gutvol-d > generally and unfortunately serves little productive purpose. > While positive and insightful comments do occur, > (and are read and appreciated!), they are easily lost > in the background of posts which far too often contain > derision, belittlement, accusation, and misrepresentation. and again, i wish i could just toss out a phalanx of terms like "derision, belittlement, accusation, and misrepresentation" with a wave of the hand. but when i say something like that, i feel a very strong need to back the charges with _evidence_. because when the derision (and the belittlement) is _deserved_, then the "accusations" are not a "misrepresentation", but indeed a clear-cut indictment. beware the person who doesn't want to discuss the charges, but merely have them dismissed as a "misrepresentation"... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Mar 7 14:42:43 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 7 Mar 2010 17:42:43 EST Subject: [gutvol-d] the best proofers Message-ID: <8dea8.69848de4.38c585e3@aol.com> here are some of the findings from my analyses of the various "experiments" done over at d.p. the best proofers miss between 5% and 25% of the errors over the course of proofing an entire book. the worst proofers miss an even higher percentage, but it is not all that much higher, probably 10-40%. there is no evidence for the position that proofers "get bored" and therefore miss a higher percentage if the text they proof is clean (i.e., has few errors)... p3 proofers are no better than p2 or p1 proofers. some errors withstood over 5 rounds of proofing; there was nothing obviously "difficult" about them. the best predictor of whether a page is now "clean" is how many people proof it without finding an error. if the last person to proof a page found an error, then you cannot reliably predict it to be error-free, no matter how confident the proofer believes that... if anyone wants to dispute or discuss these findings, i'd be open, and will ask about your supportive data. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From lee at novomail.net Sun Mar 7 15:45:50 2010 From: lee at novomail.net (Lee Passey) Date: Sun, 07 Mar 2010 16:45:50 -0700 Subject: [gutvol-d] Re: roundlessness -- 011 In-Reply-To: <2335A1A5E23445C2AD8B4E25E21FB2B7@alp2400> References: <58190.5b1478bd.38c41001@aol.com> <2335A1A5E23445C2AD8B4E25E21FB2B7@alp2400> Message-ID: <4B943AAE.6070701@novomail.net> > His apparent "holier-than-thou" attitude doesn't make up for that. said the pot to the kettle... From schultzk at uni-trier.de Mon Mar 8 01:40:28 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 8 Mar 2010 10:40:28 +0100 Subject: [gutvol-d] Re: al and his 10-foot pole (parts OT) In-Reply-To: <89722.3405ffe2.38c565f2@aol.com> References: <89722.3405ffe2.38c565f2@aol.com> Message-ID: <49F9B423-6D68-4E3D-B64C-10CEA67DD6DE@uni-trier.de> HI All, Theft and copyright infringement are interresting things in the internet world. In one sense anything on the net is up for grabs. That is anybody can download it. That is not theft. it is part of the web. On the other side if you use that copy on the web you need permission unless, already given per se. An interresting case is where you just use the URLs to access the entiity. You are effectively citing it!! You have given reference to the source. You are not using a copy. A sad development here in Germany is that it is now considered ownership of child pornography once it is loaded into main memeory! Please do not get me wrong I am against child pronograohy. But, visting a web-site thereby constitutes onwnership. What a bag of bad worms. Then again anything I load from the web I own, to use as I please privately!!! Cool ? !! regards Keith. Am 07.03.2010 um 21:26 schrieb Bowerbird at aol.com: > al said: > > IMO - what bowerbird is proposing is outright theft. > > i guess it's silly season here on the project gutenberg listserve. > people are quite casual lately tossing about the "t" word ("theft!"). > > won't somebody please come train the firehose on this attack-dog > so i don't have to? some elementary public-domain f.a.q. will do. > > in the meantime, as i have "confessed" before, i've mounted some > of roger's scansets on my own site, so if roger has a problem with > me doing that, he should send me a cease-and-desist. (kidding, > of course, which should be obvious since i hate lawyers so much; > roger can just send an e-mail, and we can discuss it all friendly.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 8 10:25:49 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 8 Mar 2010 13:25:49 EST Subject: [gutvol-d] taking stock on march 8th Message-ID: <20efd.68f1c528.38c69b2d@aol.com> march 8th is international women's day. rock on, estrogen! :+) *** congratulations to d.p. on the posting of their 17,000th e-text! the volunteers who digitize these books are _awesome_people_! *** speaking of women, and d.p. volunteers, juliet sutherland is the volunteer who has donated d.p. the most time and energy, so a special shout-out to her on this day. in a recent post to the d.p. forums, juliet admits to being frustrated by many of the "volunteers versus admins" threads on those d.p. forums. she also notes that her frustration is multiplied whenever she considers that the site does have many problems and that she, as the former top dog, bears a good deal of the responsibility. i've criticized juliet a lot, because -- frankly -- she deserved it. but it's not fun to see that anyone is frustrated, ready to quit... so i'd urge juliet to hang in there... juliet also said she often finds dkretz comments "off the wall". here's a quick note, juliet: he's usually right, and you're wrong. so try to see it from don's perspective as you hang in there... *** i'm in the middle of lots of different threads here, so i'll try and work on finishing them up this week. in no particular order... *** i want to finish my work on gardner's e-text, showing how it can now be auto-converted into various formats, as a way to demonstrate my version of "postprocessing", which seems to be far more direct than the d.p. version. *** i have some messages to post in response to carel... i've started notes about creating a proofing system, for carel and anyone else contemplating doing that. *** i have a pair of posts on the good and bad aspects of roger's roundlessness experiment... i will be finishing up the handful of books that i have taken from roger's site, comparing 'em with his output. *** i'm also going to show you a little perl script i coded which summons up the various pieces of each page of a book on roger's roundless site, and stitches 'em on one webpage, and also lets you "thumb through" the pages of the books on his site, to check them... here's the first draft of that script: > http://z-m-l.com/go/showbarebones.pl it pulls in the scan and text for each page, obviously, as well as the tweet (if any) and the log file information. under roger's current system per se, it's not convenient to do this on page after page. (although a person _can_ access each of the pieces of information independently.) but really, there's no reason why it shouldn't be easy to "walk through" the pages of a book as it's being worked. the script works for my cinema screen, but i'm not sure how well it will fit on smaller screens. but i'm unlikely to improve it in that regard, since i believe it's unlikely anyone will spend much time actually using the thing... (not that it's not useful; everything i program is useful; it's just that few people here actually use what i create.) i wrote it just to get myself back into some perl coding. notice that i _am_ willing to smooth out this code _if_ anyone really wants to use it, but -- other than that -- i'm just gonna add a few refinements to it and it's done. *** there you go... plenty of meat for this week... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 8 14:01:17 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 8 Mar 2010 17:01:17 EST Subject: [gutvol-d] roundlessness -- 012 Message-ID: <335f9.5e85a104.38c6cdad@aol.com> rfrank's roundless experiment is proving to be _very_ interesting... and, as you might expect, there is good news and there is bad news. let's talk about the good news here in post #12, the bad in post #13. *** first of all, rfrank is showing that it's not all that difficult to set up a proofing site. in just 2 months, he's put together a critical mass, and that's quite an achievement. if he shares his code with others, they'll be able to move even faster. (if he doesn't, i've got a little code that'll do the trick for people who want a bit of a head-start.) it's another matter to pull workers to the site, of course. however, if project gutenberg chose to steer people to these _other_ sites, instead of funneling all the volunteers to distributed proofreaders (who -- truth be told -- don't even _want_ new people nowadays), it wouldn't be hard at all for these sites to get enough volunteers. but even with his low numbers of volunteers, what rfrank is doing is _head_and_shoulders_ more interesting than anything d.p. is doing. his site is dynamic, while d.p. has been too moribund for too long... *** in the last week, rfrank installed a spellcheck capability to his site. after a mere 2 months. d.p. went about 5 or 6 _years_ without it. *** moreover, when d.p. finally got a programmer to code spellcheck, the process was plagued by a forum discussion that ran 30 pages. at 15 messages per page, that's 450 messages. and most of 'em were from people who didn't know what they were talking about, and thus just added a buncha noise and confusion to the process. which is why it's probably not surprising that it was coded wrong. well, "wrong" is perhaps a bit strong. but the decision was made to do spellcheck using "aspell", because "it's open-source code". which would be fine, if you needed a full-fledged spellcheck... but that's not what a proofing site needs, because the object is _not_ to have another word "suggested" (which is the hard part about coding spellcheck), but merely to _flag_suspicious_words_ (a ridiculously easy task consisting of searching a dictionary to ascertain whether the word you're checking is included therein) so that all the suspicious words can be compared to the scan... i'm guessing rfrank did his spellcheck the simple way. *** rfrank also installed a capacity for a "good" and "bad" wordlist, necessary since that customizes the dictionary for each book, and -- like d.p. -- lets the proofers suggest words to include. unlike d.p., however, under rfrank's system, whenever a person "suggests" a word, it's _automatically_ included _immediately_. at d.p., a suggestion must be considered by a superior, who might or might not agree, and might or might not be timely. this is a signal of the disgust with which d.p. treats proofers. it also means that rfrank's system throws far fewer false flags, which means it provides much greater value to the proofers... i worked very hard, in the confines of that 30-page thread, to have d.p. give the proofers an automatic capability to add words to the good and bad lists, but they just wouldn't do it. rfrank did. good for rfrank. he's smarter than the d.p. crowd. *** rfrank has also included reg-ex checks, and scanno checks, so his list of helpful tools is already very impressive, 2 months in. *** rfrank has also shown he's willing to do global changes to text, which is one of those things that d.p. has been unwilling to do, in spite of the fact that i've pointed out the utility of it for years. d.p. would rather have individual proofers correct every instance of a global error -- one by one by one, painstakingly -- instead of fixing 'em all immediately, with one global change. shameful. *** rfrank also showed considerable independence when he decided he would have his people do proofing and formatting together... it's unclear to me whether the d.p. split between those two tasks is effective or not, but the _religion_ at d.p. is that it has been... so it is quite courageous of rfrank to test that accepted "wisdom". *** rfrank also seems committed to using diffs to train up volunteers. this, of course, is one of the benefits a roundless system offers, so it's natural that he'd take advantage, but it's still a good thing. *** rfrank has given workers a way to make comments _about_ a page without actually putting them _inside_ the text, which is fantastic. (he calls this feature "page tweets".) at d.p., they have a "project thread" in the forums (as does rfrank), but the only way to make comments about a page is to put them _inside_ the text. but of course then someone later down the line has to _remove_ them from there. that's a sign of a bad workflow, when someone later on must undo something that was done earlier. *** in looking at some of the projects, it seems that rfrank has finally started doing more aggressive preprocessing of the o.c.r. itself... for instance, the number of spacey quotes has dropped remarkably. there are still some, but nowhere near the number he had before... since this is an area that i know to be _so_ important, any progress toward enlightenment at all is the sign of a very good development. *** so, all in all, there's lots of positive aspects to rfrank's experiment. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 8 16:46:27 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 8 Mar 2010 19:46:27 EST Subject: [gutvol-d] roundlessness -- 013 Message-ID: <3ed15.48c76ac1.38c6f463@aol.com> ok, i talked about the positive parts of rfrank's roundless experiment. now it's time to review the "bad news" -- the not-so-positive parts... many of the _negative_ parts are quibbles about the implementation of the positive aspects, so i will discuss them at the end of this post... there are, however, a few points that are almost fully negative, still. *** the very first thing i do with a set of files that i start working with is to make sure that they're named _correctly_and_intelligently_... that is, a filename must explain, all by itself, the file's _contents_, and the filename must contain the pagenumber from the p-book. moreover, every file must have a _unique_ filename, and every file associated with the same page should share a similar name (e.g., the name will be the same, but with a different extension.) i've done extensive work with datasets that follow these rules and with datasets that do not, and i can say without any hesitation that the datasets which do not follow these rules are much more clumsy, and waste bits of my time that are small but cumulate to significance. and that's why i now no longer will even work with badly-named files. it's just unnecessary frustration. people who use badly-named files will tell you they have "adapted" to the naming. that's pure and simple crap. they don't know better because they haven't worked long and hard with both kinds of data. they are handicapped, and they just don't know they are handicapped. rfrank names his files incorrectly. maybe someday he will learn... *** rfrank does a lot of things right. his scans are extremely well-done, which indicates that he is very careful and meticulous when scanning. it's quite likely he also does some refinements on the scans, such as straightening them, centering them, and perhaps despeckling them. they look quite nice, and they are generally a pleasure to work with... however, all this care seems to be dropped once he's done the o.c.r. his preprocessing routines used to be abysmal. they're better now, but they still have considerable room for improvement. i'm hopeful that he's learned the lesson. he has pushed many of his checks back, from postprocessing to the proofing stage. so now he just needs to push them back from the proofing stage to the preprocessing stage. it would perhaps be very helpful in this regard if _somebody_ who is working at the fadedpage.com site would _volunteer_ to do the step of nondistributed preprocessing, thus freeing rfrank from doing it... he's probably feeling very overwhelmed at the moment, so an offer like that would probably be something that he would accept readily, and it would make a remarkable difference in evolving his progress. *** we've already discussed recently that rfrank should submit his scans along with his postings to p.g. alas, he's picked up the bad habit of failing to do that from his distributed proofreader upbringing. pity. *** it would also be good if rfrank kept the linebreaks and pagebreaks of the original p-book when he submitted the book to project gutenberg. but hey, that's unlikely, isn't it? what _is_ more likely, however, is that he would keep the linebreaks _consistent_ between the various versions that he submits to p.g. but on one file i checked, the 7-bit version was wrapped differently than the 8-bit version, which was wrapped differently than the .html. this is madness, if/when it comes to doing long-term version-control. *** rfrank also picked up the bad habit from d.p. of "clothing" em-dashes. of all the stupid things d.p. does, this is among the most stupid of all. and yet rfrank, who showed the ability to rethink proofing/formatting and roundlessness per se, failed to grasp the basic stupidity of this... *** ditto with unhyphenating the end-of-line hyphenates. i take it all back about "clothing hyphens" being the most stupid thing... dehyphenating has to be the _most_ stupid, because when the proofers do this, they actually destroy the evidence that a computerized routine would use to do the job _properly_, which is _on_a_book-wide_basis_... again, the failure of rfrank to rethink such an obvious stupidity is sad... (kudos, however, to one of his members, for spelling it out in a post. let's just hope that that reasoning will soak in to rfrank's busy brain.) *** again, repeating a d.p. flaw, rfrank strips runheads and pagenumbers from his o.c.r. and perhaps fate is trying to teach him a lesson on this, because he has had several problems where text on a page was deleted, or replaced with text from some other page. these types of problems can be detected and prevented when each page contains its pagenumber. in general, you want to _retain_ this information because it "earmarks" each page of text, making it clear what book it comes from, and where. it also serves as the "suspenders" in a "belt and suspenders approach" along with the filename, which will contain the very same information, and thus the two make it very easy to crosscheck and confirm each other. the silliness of naming all your scansets "001.png" through "999.png" and expecting their subdirectory name to distinguish them is _stark_... (and it has caused all kinds of grief for people in the past, i assure you.) *** rfrank hasn't really installed any instructions of his own, just letting his members rely on their d.p. training, so he has no policy of his own on ellipses, at least that i've been able to detect. but it would be refreshing if he decided to avoid the merry-go-round of never-ending changes that sometimes happens at d.p., and went _exclusively_ with the 3-dot ellipse. (it's funny, because many of his books don't even seem to _have_ ellipses!) *** rfrank is putting a lot of stock in "c.i.p." except, to confuse _everyone_, to him, "c.i.p." means "confidence in proofer", not "confidence in page", which is how everyone else defined the term, up to this point in time... now, me?, i don't think you can put much stock in "confidence in proofer". even the best proofers miss errors, and they don't know when they miss, so i don't think that you can trust their judgment and get perfect pages. rfrank's big mistake here is that he's not necessarily looking for "perfect", since he sees himself, as the postprocessor, as the last line of correction, and he's willing to take a non-perfect page if he can get it a little faster... even if that's fine for him, i don't think it's a good way to build a system. but even then, i just don't think "confidence in proofer" will actually work. or, to be more accurate, i think it'll work just well enough that rfrank will put lots of energy into it before he finds out it doesn't work well enough. or, worst case scenario, he'll convince himself that it really _is_ working, and other people believe him, and we all end up with non-perfect pages. on the other hand, rfrank has shown in some cases that he _can_ learn from the data, and change his mind on something he held dearly, so... *** ok, now we're down to the implementation quibbles... *** first, i'll repeat that it's sad that rfrank is "archiving" his finished projects. it would help all of us learn more about roundlessness if he left them up. i offer webspace if rfrank needs it. and project gutenberg has offered too. *** it's still "in-process", so i expect that it might improve, but the spellcheck display that rfrank is offering would benefit by retaining the linebreaks, so the search for "unresolved words" on the pagescan was much easier to do. i'd also like to see each unresolved word in _clickable-button_ form, for both the good-word and bad-word lists, so a button-push would do that. (in the current form, a person has to copy-and-paste each of the words.) i must add, though, that the ability to include these words immediately is a _tremendous_improvement_ over the d.p. method, one that shows its value to the proofer right away, and is thus very robust and valuable. empowering the proofers to benefit themselves is a remarkable asset... *** in this regard, an ability for a _proofer_ to execute a global change would be a mind-blowing step, and thus a very brave thing to try... of course, bear in mind that i believe that all global changes should have every occurrence verified, so take the suggestion appropriately. and i believe that any global changes that might be required _should_ be sussed out during the preprocessing, before proofers even see text. but nonetheless, putting such a powerful tool in the hands of proofers would speak _volumes_ on the responsibilities you entrust them with, and thus make a tremendous statement that would _embolden_ them... even if they never ever used it... *** as it is now, though, rfrank does the global changes himself, and he has been a little reluctant to do the job in the way that he really "should"... at least he was in one case -- where he declined to fix a contraction -- but perhaps that was not representative of his feelings more generally, so i'll let it go for now... *** as i said before, i don't know whether the d.p. separation between proofing and formatting is a good thing or not. i see the arguments in favor of it, and they seem compelling to some degree, but i also know that the vast majority of pages have little or no formatting, so i'm reluctant to lay another step on the overall process for no benefit. so, in cases like this where the answer is unclear, i'd do an experiment. luckily, rfrank is doing an experiment. it's not a well-controlled one, and we're not really privy to all of the data, so it's far from being ideal, but at least we're engaged in the active questioning of an unknown... still, it would be nicer if we were doing the experiment _properly_... *** rfrank does a pretty good job of showing proofers their diffs, _except_ that you must visit each project-page to see your diffs for that book... it would be far better if you were presented with all of 'em on one page. (and i would emphasize that page by presenting it to the user _first_, when they return for more proofing, so they'll realize its importance.) there's also the slightly troubling aspect that if you mark a page "done", the odds are lowered that it will be proofed again, so you don't obtain the satisfaction of getting a "no-diff" result on that page. i do believe that's counterproductive, and i'd like to see every page reproofed once, even after the page was marked "done", even by a high-c.i.p. proofer... and, of course, having the page reproofed, and having the "done" status confirmed by a "no-diff" by the subsequent proofer, would also raise the "confidence-in-page" for that page, and thereby serve a double benefit... (conversely, if the next proofer finds an error, they rescued a false done.) *** the "page tweet" idea is a good one. (the astute observer might realize that this is the same idea i always use on the bottom of my web-pages, where a person can leave a comment about that specific p-book page.) however, a way to _consolidate_ the tweets for a book would be useful. (and easy to code.) that way, a proofer could look at all of the "tweets" and perhaps answer some of the questions being posed, or fix some of the problems being reported, or take some other kind of positive action (such as finding a person who _can_ fix the problem if you cannot do it). also, it would be good if there were some dedicated buttons on the page, so it would be easy to say things like "difficult formatting, please check" or "foreign language specialist needed on this page", or stuff like that... again, that way a person perusing all of the tweets for a book will know exactly what needs to be done among this list of possible specific tasks, *** and -- just to finish up this post by taking it back to the beginning -- i note with amusement that rfrank uses the term "page" throughout his system. he lists the "pages" that you've done, and calls the notes you attach "page tweets", and the diffs are listed by "page", and so on. so it is ironic that when he talks about "page 123", he's not _really_ talking about _page_123_ at all! he's really talking about _.png_ 123! and the file named "123.png" probably isn't about page 123 at all! indeed, let's review the 7 files named "123.png" rfrank has up now: > http://fadedpage.com/p/201002140505/d/123.png > http://fadedpage.com/p/201002140533/d/123.png > http://fadedpage.com/p/201002270757/d/123.png > http://fadedpage.com/p/201002280257/d/123.png > http://fadedpage.com/p/201003020840/d/123.png > http://fadedpage.com/p/201003040537/d/123.png > http://fadedpage.com/p/201003070309/d/123.png what we actually find are pages 120, 112, 118, 106, 82, 124!, and 118, respectively. that's quite a range of pages, but alas, none are page 123. so any reference to a "page" number on the faded.com website is gonna frustrate anyone who wants to know what _page_ was being talked about, once rfrank has gone and deleted all of those files. which is a real pity... but alas, here i am talking about filenaming conventions again. help me! time to draw this to a close... *** while i'm letting myself discuss the negatives without feeling any guilt, i might add that it'd be nice if rfrank shared data from his experiments. of particular interest are all the intermediate files, such as the various pages as saved by individuals, and the concatenated text file at various "checkpoints" along the way, notably before and after postprocessing... without such data, we really have no way of evaluating the experiment! rfrank comes from a world of engineers working in private companies, where data is closely guarded, and he doesn't seem to have the attitude that is prevalent in the scientific world that data belongs to the public, and that sunshine is the best disinfectant, and open data is a positive... in this regard, i read and liked this article: > http://flowingdata.com/2010/03/04/think-like-a-statistician-without-the-math/ i sure could learn a lot more if rfrank were open with sharing his data, and my guess is that lots of other people could learn lots more as well. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 9 13:01:50 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 9 Mar 2010 16:01:50 EST Subject: [gutvol-d] the comparison method does indeed work, and work well Message-ID: a little while back, i pointed out that a person can compare two independent digitizations to find errors in both of them, and that this method works very well. carel said: > That depends on a lot of factors including the assumption > that two OCR programs would not make the same mistake that's a good point. if the two digitizations have errors in common, then the comparison method won't be able to find them, and thus its effectiveness will be lessened somewhat. there's no argument with that. what's surprising to me, however, is how many people are completely defeated by this _possible_ shortcoming. upon learning that there _might_ be a problem with the comparison method, they dismiss it with no other thought. not me. i set out to actually _test_ the assumption. i documented the results in a thread in the d.p. forums. you can search for "revolutionary o.c.r. proofing". it's at: > http://www.pgdp.net/phpBB2/viewtopic.php?t=24008 as i note there, i presented the data earlier elsewhere, at: > http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005& post=2005-10-03,3 so yeah, that's right, my findings are over 4 years old now. *** what my research found is that there were virtually _no_ errors-found-in-common between the two digitizations. and this finding was replicated, and replicated once again. in other words, the effectiveness of the comparison method is _not_ lessened by this possible shortcoming. no indeed, the evidence says it is not even affected in the slightest way. the clarity of the results was striking; they are unforgettable. if you doubt the data, i encourage you to repeat the research. because repeating the possible problem, without any data, won't get anyone very far in the future, not if i'm listening... *** here's a quick-and-dirty experiment, for anyone willing... i just used the comparison method on gardner's e-text, and found 159 differences between his work and mine... i then resolved the differences by consulting the scans... 79 differences were due to errors in his work. 77 were due to errors in mine. 3 were due to errors in _both_ his and mine. now, of course, any errors-in-common will still reside in both his and mine. why don't you see if you can find any? > http://z-m-l.com/go/gardn/gardn.zml > http://z-m-l.com/go/gardn/gardnp123.html i'll be waiting. but i won't be holding my breath... *** carel said: > I feel that a human looking at > a smaller subset of a large document > is a good thing in the error finding process. > You apparently do not think it is. if the comparison method has already found all the errors, why waste the time and energy of a human rechecking that? > Neither of us is right or wrong: > It is a matter of perspective and opinion. unless i get a good answer to the question that i just asked, my opinion will continue to be that i am _absolutely_right_, and you're wrong because you're wasting human resources. that's my perspective, and i'm not changing it... :+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 9 15:46:35 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 9 Mar 2010 18:46:35 EST Subject: [gutvol-d] Re: roundlessness -- 013 Message-ID: <379bb.624c0680.38c837db@aol.com> i said: > even then, i just don't think "confidence in proofer" will actually work. > > or, to be more accurate, i think it'll work just well enough that rfrank will > put lots of energy into it before he finds out it doesn't work well enough. > > or, worst case scenario, he'll convince himself that it really _is_ working, > and other people believe him, and we all end up with non-perfect pages. good thing i posted that yesterday. because today rfrank posted his first "informal analysis". and it looks like i was right... rfrank did his analysis on 32 pages that were marked as "done" but then subsequently proofed again, as is done for a random sample... he admits this is a small number of pages, and that there are also "many factors at play", but then goes on to draw conclusions anyway. of the 32 pages, two had added proofer notes, and 1 error was fixed. he doesn't tell us if either (or both) of the proofer notes were good, in the sense that they pointed out something of value, so we'll have to assume that they were meaningless and just added noise to the text. but even then, we have 1 error missed in 32 pages. on the face of it, that means that 3% of the "done" pages had an error. so, for a 200-page book, that would cumulate to a total of 6 mistakes. again, by my 1-error-every-10-pages criterion, that's fully acceptable. but by the (unrealistic) standards of _most_ of the volunteers, it's not. rfrank concludes that "this seems to say that making sure every page is seen by two proofers is not warranted"... so that's his take on this. *** partly the decision rests on the abundance of proofers. if you have lots and lots of proofers, like d.p., then you can afford to send a page through them 2 times or 3 times, even 4 or 5 times. but if your proofers are scarce, like they are over at fadedpage.com, then you might be reluctant to have them view a page even twice... i think i'm pretty good about making sure proofers are used _wisely_. i don't think i abuse their contribution, or that i take 'em for granted; neither am i afraid to use their resources if it is responsible to do so. and i think having 2 people verify a page as clean is responsible use. *** the other thing, though, in evaluating all these experiments, is that you need to know how many errors there _really_ were on each page. only _then_ can you accurately access the accuracy of the proofers... remember that there are lots of pages in these books that have _no_ errors on them, none at all. is it any surprise, then, that they were _actually_ "done" when they were _marked_ as "done"? not hardly... likewise, it isn't really a surprise when a page with _one_ error on it has that error fixed, is then marked as "done", and is _really_ done. what you have to pay attention to, in such cases, are the pages where an error is _not_ found by the first person, who marks it "done", but is then found by the second person. rfrank isn't making nearly enough information available that we can analyze the results in a reasonable way. so i guess we just have to "trust" him. i just wish i had more faith in his reasoning. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Mar 10 15:52:02 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 10 Mar 2010 18:52:02 EST Subject: [gutvol-d] any arguments against "free-range" proofing? Message-ID: <36c3a.27b91b5.38c98aa2@aol.com> the d.p. proofing system locks each page to a single proofer. (there's one and only one p1 proofer, p2 proofer, and so on.) so does rfrank's roundless system; once a page has been assigned to a proofer, it's semi-difficult to even look at it. and if someone else has reproofed it _after_ that person, then the old version is stored somewhere i can't figure out, so tracking the diffs simply cannot be done by an outsider. (the d.p. system at least allows you to do that tracking, and even has a routine that will show you round-to-round diffs.) it is by analyzing these round-to-round diffs very closely that you can get a sense for how a page progresses from the initial o.c.r. to its final -- hopefully perfect -- stage... *** the question i have today is whether there is a good reason why a page needs to be assigned-and-locked to one person. is there any reason why you shouldn't allow any proofer to go and proof any page in a book? yes, it would mean that some pages might be proofed several times, but so what? that's not necessarily a _bad_ thing, is it? i'm writing code now to build my own proofing system, and i'm curious about this particular aspect. i think it would be important to inform a proofer how many previous people have proofed each specific page, so as to let that proofer choose whether to do an additional proof, but if they _want_ to do it, is there any reason to disallow it? *** partly this ties into _incentives_... most people like _finding_and_fixing_ errors, so there'll be a good incentive for people to work in the "first" proofing... but even in that first proofing, there are a lot of pages that are _already_ perfect, so there are no errors to find or fix... and in the second and third proofings, the number of errors that are left will be small, even collected over a whole book. so i feel it's very important to reward people for _certifying_ a page -- i.e., confirming that the page is indeed error-free. if i was to put this in terms of a "point" system, it'd be this: > 5 points for fixing all of the remaining errors on a page. > 4 points for doing the first "certification" of a clean page. > 3 point for doing the second "certification" of a page. > 2 point for doing the third "certification" of a page. > 1 point for fixing _some_ (but not all) errors on a page. if you certify a page clean, and someone later finds an error, the points turn _negative_. so make sure of your certification! if you gather enough points, you win _a_million_dollars_! ;+) *** there are a few things you need to stipulate for such a system: 1. there is one -- and only one -- "correct" way to do a page. 2. which means there are no ambiguous guidelines in place. 3. and whitespace is significant. 4. which means there are _no_ "insignificant" diffs. 5. all diffs are reviewed, and can be challenged for correctness. 6. so when a page comes out of proofing, that page is _done_. 7. which means "postprocessing" is a largely automatic thing. *** you can discuss any aspect of this post, but what i'm seeking are any arguments people can think of _against_ free-range proofing. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Thu Mar 11 00:01:55 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 11 Mar 2010 09:01:55 +0100 Subject: [gutvol-d] Re: any arguments against "free-range" proofing? In-Reply-To: <36c3a.27b91b5.38c98aa2@aol.com> References: <36c3a.27b91b5.38c98aa2@aol.com> Message-ID: Hi BB, I do not see anything truely speaking against such a system. The only problems are the administrative tasks involved. 1) you have to track all this. 2) keep everything store somewhere 3) keep everything in sync The other question that comes to mind is you will need an authority/ies that finally certify that a page satisfies your criteria as being done. Some may call it a administrative nightmare, but it should be workable. regards Keith. Am 11.03.2010 um 00:52 schrieb Bowerbird at aol.com: [snip, snip] > it is by analyzing these round-to-round diffs very closely > that you can get a sense for how a page progresses from > the initial o.c.r. to its final -- hopefully perfect -- stage... > > *** > > the question i have today is whether there is a good reason > why a page needs to be assigned-and-locked to one person. > > is there any reason why you shouldn't allow any proofer to > go and proof any page in a book? yes, it would mean that > some pages might be proofed several times, but so what? > that's not necessarily a _bad_ thing, is it? > > i'm writing code now to build my own proofing system, and > i'm curious about this particular aspect. > > i think it would be important to inform a proofer how many > previous people have proofed each specific page, so as to > let that proofer choose whether to do an additional proof, > but if they _want_ to do it, is there any reason to disallow it? > > *** > > partly this ties into _incentives_... > > most people like _finding_and_fixing_ errors, so there'll be > a good incentive for people to work in the "first" proofing... > > but even in that first proofing, there are a lot of pages that > are _already_ perfect, so there are no errors to find or fix... > > and in the second and third proofings, the number of errors > that are left will be small, even collected over a whole book. > > so i feel it's very important to reward people for _certifying_ > a page -- i.e., confirming that the page is indeed error-free. > > if i was to put this in terms of a "point" system, it'd be this: > > > 5 points for fixing all of the remaining errors on a page. > > 4 points for doing the first "certification" of a clean page. > > 3 point for doing the second "certification" of a page. > > 2 point for doing the third "certification" of a page. > > 1 point for fixing _some_ (but not all) errors on a page. > > if you certify a page clean, and someone later finds an error, > the points turn _negative_. so make sure of your certification! > > if you gather enough points, you win _a_million_dollars_! ;+) > > *** > > there are a few things you need to stipulate for such a system: > > 1. there is one -- and only one -- "correct" way to do a page. > 2. which means there are no ambiguous guidelines in place. > 3. and whitespace is significant. > 4. which means there are _no_ "insignificant" diffs. > 5. all diffs are reviewed, and can be challenged for correctness. > 6. so when a page comes out of proofing, that page is _done_. > 7. which means "postprocessing" is a largely automatic thing. > > *** > > you can discuss any aspect of this post, but what i'm seeking are > any arguments people can think of _against_ free-range proofing. > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 11 10:51:18 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 11 Mar 2010 13:51:18 EST Subject: [gutvol-d] a question for dkretz about twisted Message-ID: <6d556.783644ab.38ca95a6@aol.com> ok, so i'm coding a proofing system. and wow, i'm impressed with myself and how far i've gotten in just 2 days. i've got a solid engine going already... in programming, the saying goes that the first 90% of a project takes 90% of the time, and the remaining 10% takes the other 90% of the time. and it's true. but still, to have a solid engine after just 2 days means i think i can have a pretty smooth system in 2 weeks... but before i go reinvent the wheel... a question for dkretz on "twisted"... it was coded in "air", so in _theory_ anyway, it will run on a web-server. so, don, can you make it do that? is there any place where it _is_ running on a web-server now? if someone (like me) wanted to run it on their server, would you make the app available to them? i'd love to see it run in a browser. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Thu Mar 11 13:39:37 2010 From: dakretz at gmail.com (don kretz) Date: Thu, 11 Mar 2010 13:39:37 -0800 Subject: [gutvol-d] Re: a question for dkretz about twisted In-Reply-To: <6d556.783644ab.38ca95a6@aol.com> References: <6d556.783644ab.38ca95a6@aol.com> Message-ID: <627d59b81003111339y2604c47fmd4261ea72894bb14@mail.gmail.com> Close, but I don't think close enough. After working with Adobe/Actionscript/Flex/AIR for a while, I appreciate what Steve Jobs said recently when he was asked why Apple doesn't want to work closely with them. He said they were lazy. I think he meant that they have had an unchallenged franchise (with PDF and Flash) for so long that everything is just "good enough" and will be really ready in the next version. What I'd recommend you consider is building on the WordPress blog engine. Almost every variation of user input technique gets implemented early and often because text input is such a core requirement. You get built-in user validation, text-versioning, etc and the free support community is huge. On Thu, Mar 11, 2010 at 10:51 AM, wrote: > ok, so i'm coding a proofing system. > > and wow, i'm impressed with myself > and how far i've gotten in just 2 days. > i've got a solid engine going already... > > in programming, the saying goes that > the first 90% of a project takes 90% of > the time, and the remaining 10% takes > the other 90% of the time. and it's true. > > but still, to have a solid engine after > just 2 days means i think i can have > a pretty smooth system in 2 weeks... > > but before i go reinvent the wheel... > > a question for dkretz on "twisted"... > > it was coded in "air", so in _theory_ > anyway, it will run on a web-server. > > so, don, can you make it do that? > > is there any place where it _is_ > running on a web-server now? > > if someone (like me) wanted to > run it on their server, would you > make the app available to them? > > i'd love to see it run in a browser. > > -bowerbird > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Thu Mar 11 13:43:21 2010 From: dakretz at gmail.com (don kretz) Date: Thu, 11 Mar 2010 13:43:21 -0800 Subject: [gutvol-d] Re: a question for dkretz about twisted In-Reply-To: <627d59b81003111339y2604c47fmd4261ea72894bb14@mail.gmail.com> References: <6d556.783644ab.38ca95a6@aol.com> <627d59b81003111339y2604c47fmd4261ea72894bb14@mail.gmail.com> Message-ID: <627d59b81003111343t30e2d63eq4cf510fced410c2b@mail.gmail.com> Another reasonable alternative might te Google Docs/Google Apps. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Morasch at aol.com Thu Mar 11 15:04:43 2010 From: Morasch at aol.com (Morasch at aol.com) Date: Thu, 11 Mar 2010 18:04:43 EST Subject: [gutvol-d] Re: a question for dkretz about twisted Message-ID: <81cb4.31b86b2e.38cad10b@aol.com> don said: > Close, but I don't think close enough. ok, cool. thank you. just thought i'd ask. i'd be interested in playing with it, though, if you've got it up and available somewhere, or if i can install it on my own site, just to see exactly how "close" it comes... > I appreciate what Steve Jobs said recently i've hated adobe for a long, long, long time... but yeah, at one time, i did respect their work. now it just seems shoddy. bloated and shoddy. > What I'd recommend you consider is > building on the WordPress blog engine. sounds like too much overhead cruft to me... i like to be close to the metal. > Another reasonable alternative > might te Google Docs/Google Apps. sounds like more cruft. i'll stick with perl... thanks again. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Thu Mar 11 15:23:00 2010 From: dakretz at gmail.com (don kretz) Date: Thu, 11 Mar 2010 15:23:00 -0800 Subject: [gutvol-d] Re: a question for dkretz about twisted In-Reply-To: <81cb4.31b86b2e.38cad10b@aol.com> References: <81cb4.31b86b2e.38cad10b@aol.com> Message-ID: <627d59b81003111523s24bfe613lf8987abef5a50d14@mail.gmail.com> If you go to the same site where you download Twister, the source is all there too. http://code.google.com/p/dp50/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Thu Mar 11 15:24:14 2010 From: jimad at msn.com (James Adcock) Date: Thu, 11 Mar 2010 15:24:14 -0800 Subject: [gutvol-d] New Tool "pgdiff" In-Reply-To: <379bb.624c0680.38c837db@aol.com> References: <379bb.624c0680.38c837db@aol.com> Message-ID: I have created a new command line tool "pgdiff" along the lines of what BB has been talking about, which compares two independently OCR'ed texts on a word-by-word basis, so as to find and flag errors. In this regards it is similar to "worddiff", as opposed to "diff" which is the approach BB has been talking about, which compares on a per-line basis. But my new tool has several tricks that haven't been seen before: It can be used with two different versions or editions of the text as long as there are not really long differences in the texts. IE the two texts do not have to have their linebreaks at the same locations. It tries to retain the linebreak locations of the first input text in preference to the second input text. IE the first input text should represent the target text you are trying to create. This means it can also be used for "versioning" - for example using a copy of a PG text from one version or edition of a text to help fix and create a text from a different version or edition of the text. It can also be used to recover linebreak information, where linebreak information has been lost, for example to take an older PG text and recover linebreak information in order to allow, for example, the resubmission of that PG text back to DP for a clean-up pass. In normal mode when if finds an mismatch it outputs the mismatch like this { it'll | it'11 } within the body of the text so that given a regex compatible editor it is very quick to search for and fix the errors found. As BB says, having tried this approach, the manual approach of trying to visually spot errors seems pretty painful and silly. I find that finding differences on a word basis rather than a line basis makes it quicker and easier to fix the errors in general. You do want to do some regex punc normalization on the two OCRs to try to remove the trivial differences prior to running the tool, in order to cut down the number of trivial errors it finds that you have to fix. Source and a compiled windows version at http://www.freekindlebooks.org/Dev/StringMatch It is based on traditional Levenshtein Distances where the token is taken to be the non-white part of a "word" as opposed to measuring distances between lines of text or on individual characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Thu Mar 11 15:40:16 2010 From: jimad at msn.com (James Adcock) Date: Thu, 11 Mar 2010 15:40:16 -0800 Subject: [gutvol-d] Re: any arguments against "free-range" proofing? In-Reply-To: References: <36c3a.27b91b5.38c98aa2@aol.com> Message-ID: First, it depends on what you mean by "locked to a particular person." Typical of DB type stuff having two people editing the same record (in this case the same page) at the same time is typically taken to not be a good thing. Assuming you are not suggesting doing away with the typical DB convention of only having one person editing a record (the same page) at a given time, then the remaining problem is "fix thrashing" which we already see happening some in DP land. IE P1 introduces a fix, and then P2 says no *I* think it should be fixed this way and then P3 says no *I* think it should be fixed THIS way. At least in DP land P1, P2, and P3 are different people, so the "fix" may not converge but at least its not thrashing - meaning that there is only three rounds of time-wasting going on. With roundlessness you could potentially run into "proofer wars." Well, actually in DP land you can run into proofer wars too - trust me - its just that the proofers have to run to a "higher authority" to engage in fix thrashing - the DP system doesn't seem to me to directly allow proofer wars to happen. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Thu Mar 11 15:40:35 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Thu, 11 Mar 2010 18:40:35 -0500 Subject: [gutvol-d] Re: a question for dkretz about twisted In-Reply-To: <6d556.783644ab.38ca95a6@aol.com> References: <6d556.783644ab.38ca95a6@aol.com> Message-ID: <4B997F73.6040700@teksavvy.com> On 11-Mar-2010 13:51, Bowerbird at aol.com wrote: > > a question for dkretz on "twisted"... > > it was coded in "air", so in _theory_ > anyway, it will run on a web-server. Adobe AIR is a kind of stand-alone container for Flash and Flex. While Flash and Flex are mostly thought of as "web" technologies what is actually going on is that the application is being sent to your browser and executed there. The net is that the whole shebang is essentially a client-side proposition, and an AIR application does not translate easily into a "just a browser" server-hosted application. What could be done is to have a Flex/LiveCycle/Blaze Data Services app on the server that could manage and dish out page images and whatnot, to allow collaborative operation. ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From jimad at msn.com Thu Mar 11 16:09:10 2010 From: jimad at msn.com (James Adcock) Date: Thu, 11 Mar 2010 16:09:10 -0800 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: References: <379bb.624c0680.38c837db@aol.com> Message-ID: PS: To help clarify what I am talking about I enclose below an except of the output of this tool (being used for versioning, error-flagging and linebreak recovery) ===== got to going away so much, too, and locking me in. Once he locked me in and was gone three days. It was dreadful lonesome. I judged he had got { drowned, | drownded, } and I wasn't ever going to get out any more. I was scared. I made up my mind I would fix up some way to leave there. I had tried to get out of that cabin many a time, but I couldn't find no way. There { warn't | wam't } a window to it big enough for a dog to get through. I couldn't get up the chimbly; it was too narrow. The door was thick, solid oak slabs. Pap was pretty careful not to leave a knife or anything in the cabin when he was away; I reckon I had { hunted | himted } the place over as much as a { hundred | himdred } times; well, I was most all the time at it, because it was about the only way to put in the time. But this time I found something at { last ; | last; } I found an old rusty wood-saw { without | v/ithout } any handle; it was laid in between a rafter and the clapboards of the roof. I greased it up and went to work. There was an old horse-blanket nailed against the logs at the far end of the cabin behind the table, to keep the wind from blowing through the chinks and putting the candle out. I got under the table and raised the blanket, and went to work to { saw | sav/ } a section of the big bottom log { out - big | out--big } enough to let me through. Well, it was a good long job, but I was getting { towards | toward } the end of it when I heard pap's gun in the woods. I got rid of the signs of my work, and dropped the blanket and hid my saw, and pretty soon pap come in. ===== One input file has line breaks that look like this: .it. I was all over welts. He got to going away so much, too, and locking me in. Once he locked me in and was gone three days. It was dreadful lonesome. I judged he had got drowned, and I wasn't ever going to get. ===== The other input file has line breaks that look like this: .got to going away so much, too, and locking me in. Once he locked me in and was gone three days. It was dreadful lonesome. I judged he had got. But it doesn't matter, the algorithm will still find the word differences. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 11 17:58:09 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 11 Mar 2010 20:58:09 EST Subject: [gutvol-d] Re: a question for dkretz about twisted Message-ID: <8cfa2.79442125.38caf9b1@aol.com> gardner said: > an AIR application does not translate easily into > a "just a browser" server-hosted application. so, gardner, i think you're telling me that "it can't be done", in regard to running "twister" in a browser. if i'm mistaken, and it can be done -- i.e., _you_ think that _you_ can do it -- do please let me know... thanks. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 11 18:29:47 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 11 Mar 2010 21:29:47 EST Subject: [gutvol-d] [SPAM] re: New Tool "pgdiff" Message-ID: <8ee0f.65afe1b7.38cb011b@aol.com> jim said: > To help clarify what I am talking about > I enclose below an except of the output of this tool personally, i find that output to be obtuse and hard to read. and editing it would be very problematic, and error-ridden. giving people a tool that would present the choices to them, and let them click a button for the correct one, would make this far more easier to work with. *** jim said: > I have created a new command line tool ?pgdiff? good for you, jim! i assume you wrapped the "wdiff" routine in an .exe? that'll make it easier to use for normal windows users. > In this regards it is similar to ?worddiff?, as opposed > to ?diff? which is the approach BB has been > talking about, which compares on a per-line basis. well, i use "diff" as a generic term. whether you use "diff" or "wdiff" depends largely on whether the lines are broken in a similar way, or have been rewrapped. i usually find it's worthwhile to fix the linebreaks so they are identical in the files, and match the p-book. that's because to resolve many of these differences, you have to look at the actual page, and that job is infinitely easier if your linebreaks match the page... > But my new tool has several tricks > that haven?t been seen before: um, ok... > It can be used with two different versions or editions of > the text as long as there are not really long differences ok, but that's something that's been "seen before"... > This means it can also be used for ?versioning? ? > for example using a copy of a PG text from one version > or edition of a text to help fix and create a text > from a different version or edition of the text. i'm not sure i understand what you're talking about here. if there are differences, how do you know if the differences are edition differences or o.c.r. differences? you'd have to refer to the page-scans for one version or the other, right? > It can also be used to recover linebreak information, > where linebreak information has been lost, for example > to take an older PG text and recover linebreak information > in order to allow, for example, the resubmission of that > PG text back to DP for a clean-up pass. again, not something that hasn't been seen before... but i'd love to see this in action. carlo has _posted_ that people could use wdiff to do this chore automatically, but when asked to explain the procedure, he failed to follow up. > In normal mode when if finds an mismatch it outputs > the mismatch like this { it?ll | it?11 } within the body of > the text so that given a regex compatible editor it is > very quick to search for and fix the errors found. i'd really like to learn the reg-ex that makes this "very quick". i assume you'd search for the first half of the pair, and erase it if it's incorrect. then you'd do the same for the second half. then you'd go back and globally remove the excess characters. but i'd sure like to see that in action. and i don't think it would be very fast. or feel very easy. especially when -- for an error like '11 -- a global change within each of the files would end up being more efficient. it's also the case that, as i mentioned up above, you _need_ to have the scan available for viewing to resolve some diffs, so the ability of the tool to present those scans is _crucial_. > I find that finding differences on a word basis rather than > a line basis makes it quicker and easier to fix the errors if you've looked at the diffs i've presented, the _indicator_line_ narrows your focus down to a single word (if that's the diff), or even a single _character_ (like a comma, if that's the diff). it's just showing you the entire line so you have the _context_, and so you can _find_that_line_ more easily on the page-scan. > Source and a compiled windows version at i'll take a look, as soon as i happen to be around a windows box. in the meantime, congratulations for programming a tool! :+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Thu Mar 11 19:33:03 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Thu, 11 Mar 2010 22:33:03 -0500 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: References: <379bb.624c0680.38c837db@aol.com> Message-ID: <4B99B5EF.8000100@teksavvy.com> On 11-Mar-2010 19:09, James Adcock wrote: > PS: To help clarify what I am talking about I enclose below an except of > the output of this tool > This suits me. I have a project on the go that I will try this on pretty promptly. I will let you know what I come up with. Thank you! ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From traverso at posso.dm.unipi.it Thu Mar 11 20:30:57 2010 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Fri, 12 Mar 2010 05:30:57 +0100 (CET) Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: (jimad@msn.com) References: <379bb.624c0680.38c837db@aol.com> Message-ID: <20100312043057.B1BDBFFC5@cardano.dm.unipi.it> >>>>> "James" == James Adcock writes: James> PS: To help clarify what I am talking about I enclose below James> an except of the output of this tool James> (being used for versioning, error-flagging and linebreak James> recovery) It seems very much similar to wdiff output, may you please show where your tool gives something basically different from wdiff? Carlo Traverso From ke at gnu.franken.de Thu Mar 11 23:13:00 2010 From: ke at gnu.franken.de (Karl Eichwalder) Date: Fri, 12 Mar 2010 08:13:00 +0100 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: <20100312043057.B1BDBFFC5@cardano.dm.unipi.it> (Carlo Traverso's message of "Fri, 12 Mar 2010 05:30:57 +0100 (CET)") References: <379bb.624c0680.38c837db@aol.com> <20100312043057.B1BDBFFC5@cardano.dm.unipi.it> Message-ID: traverso at posso.dm.unipi.it (Carlo Traverso) writes: > It seems very much similar to wdiff output, may you please show where > your tool gives something basically different from wdiff? Or ediff, coming with Emacs. -- Karl Eichwalder From Bowerbird at aol.com Fri Mar 12 11:05:14 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 12 Mar 2010 14:05:14 EST Subject: [gutvol-d] any arguments against automatic good-word listing? Message-ID: <1da4d.5875f551.38cbea6a@aol.com> i recently praised rfrank for simplifying the procedure of adding a word to the "good-words list", compared to d.p. when a word which _should_ be on that list is missing, proofers have to struggle through the false flagging of that word, and that decreases the efficiency of flagging. for instance, when the name of a character in the book is flagged every time it appears, it can cause "flag fatigue", by virtue of its frequent nature. moreover, it can cause you to miss the cases where the name was misrecognized, because you've grown accustomed to skipping that flag... (if the name is in the good-words list, only an _incorrect_ version of the name gets flagged, which is what we want.) and once your good-words list is _complete_, you can do spellcheck on the book and have it come out _clean_. this is extremely valuable, because it means that you can repeat that spellcheck after any major editing operation (or at any milestone that you decide during the workflow) to make sure that your processing didn't introduce errors. so it's in everyone's best interest to have a good-words list which actually contains all "good words" in the book. which is why rfrank's simple-and-immediate procedure is far superior to the d.p. way, which is hard and slow... but there are methods even better than rfrank's... one of the most useful tools in my arsenal is one that takes text as its input -- up to the entirety of a book -- and quickly spits out a list of words not in its dictionary. thus it gives me a list of words that i'll need to check... but _many_ of these words, primarily _names_, but also words that appear a relatively large number of times, are ones that will go onto the good-words list for the book. so this tool can be used in _preprocessing_ to create a good-words list that is actually compellingly complete. but let's say you did not use this tool in preprocessing, and your good-words list is still missing lots of words... now let's look at a case where a proofer has just finished a page that had a number of words flagged on it, because they were not in the dictionary or on the good-words list. even if the proofer didn't take the time to add the flagged words to the good-words list, should they be auto-added? because, if they're ok on this page, they should be added! in other words, why make the proofer go to _any_ trouble to add a word to the good-words list? just analyze the page they've saved, as "good", finding all words that are not on the good-words list, and adding them automatically? as far as i can see, there are 2 problems that might result. the first is that the proofer made an error, and failed to catch a flagged word that was incorrect. in such a case, that would mean that any other occurrences of that word would not be flagged. that's unfortunate, of course, but is it a great tragedy? i think not. proofers need to know that an unflagged word _might_ be incorrect, n'est pas? of course they do. that's the essence of a stealth scanno. the second problem is that the word might be correct on _this_ page, but is _incorrect_ if it appeared elsewhere, and would need to be flagged there. again, i do not take a failure to flag every bad word as being necessarily bad. but directly to the point of this second possible problem, i simply don't think this situation occurs all that often... so i would like to issue a challenge, to the people who look at more books-in-progress than i do, to _locate_ this situation, where a non-dictionary word is _correct_ on _one_ page, yet _incorrect_ on some _other_ page... after all, this is the raison d'etre for the d.p. procedure, which insists that a word nominated for the good-words list must be inspected/approved by the project manager. so -- if this situation is more common than i believe -- project managers should have _lots_ of examples for me. so let's hear them. anyway, that's the suggestion, that when a page is proofed, all the words which had been flagged are _automatically_ added to the good-words list. and yes, i will also add that i believe such additions should be _screened_ by someone, but that's in keeping with my overall plan that any and all changes that are made will be doublechecked by someone. -bowerbird p.s. if you were quick enough to grok a flip-side suggestion that any word on the good-words list which was _changed_ on any page should automatically be _removed_ from the good-words list, give yourself a gold star for a sharp mind. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Fri Mar 12 15:58:40 2010 From: jimad at msn.com (James Adcock) Date: Fri, 12 Mar 2010 15:58:40 -0800 Subject: [gutvol-d] Re: [SPAM] re: New Tool "pgdiff" In-Reply-To: <8ee0f.65afe1b7.38cb011b@aol.com> References: <8ee0f.65afe1b7.38cb011b@aol.com> Message-ID: >giving people a tool that would present the choices to them, and let them click a button for the correct one, would make this far more easier to work with. Sorry ? I guess I assumed people know how to use a regex editor! -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Fri Mar 12 20:53:57 2010 From: dakretz at gmail.com (don kretz) Date: Fri, 12 Mar 2010 20:53:57 -0800 Subject: [gutvol-d] Re: [SPAM] re: New Tool "pgdiff" In-Reply-To: References: <8ee0f.65afe1b7.38cb011b@aol.com> Message-ID: <627d59b81003122053p28ca57c5wef6a9f618c81ef0d@mail.gmail.com> The good news is probably all of them who do and also have any interest in proofreading are probably right here on this mailing ist. On Fri, Mar 12, 2010 at 3:58 PM, James Adcock wrote: > >giving people a tool that would present the choices to them, > and let them click a button for the correct one, would make > this far more easier to work with. > > > > Sorry ? I guess I assumed people know how to use a regex editor! > > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Mar 12 22:08:56 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 13 Mar 2010 01:08:56 EST Subject: [gutvol-d] Re: New Tool "pgdiff" Message-ID: <43aab.ad1a0fd.38cc85f8@aol.com> jim said: > Sorry ? I guess I assumed people know how to use a regex editor! um, you just shot yourself in the foot, jim. you assume that people know how to use a reg-ex editor, but you also assume they need to have wdiff wrapped in an .exe? not much logic there, i'm afraid... but hey, for the sake of completeness of the thread, how about you quickly run through how i'd "use a reg-ex editor" for this? because i honestly don't know. (and i _do_ know how to wdiff.) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From lee at novomail.net Sat Mar 13 11:34:49 2010 From: lee at novomail.net (Lee Passey) Date: Sat, 13 Mar 2010 12:34:49 -0700 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: References: <379bb.624c0680.38c837db@aol.com> Message-ID: <4B9BE8D9.5020904@novomail.net> On 3/11/2010 4:24 PM, James Adcock wrote: > I have created a new command line tool ?pgdiff? along the lines of what > BB has been talking about, which compares two independently OCR?ed texts > on a word-by-word basis, so as to find and flag errors. [snip] I think this will be a very useful tool moving forward, at least to me. I particularly like the fact that the code is not derived from the GNU diff program. wdiff, of which Mr. Traverso is so fond, is actually just a front end to diff; it takes the input files and rewrites them so that each word is on a separate line, and then passes the rewritten lines to diff. Once you have the diff output it somehow figures out how to merge the results back with the originals, but I actually lost interest in figuring out the code when I realized in required the GNU diff program to work. One of the reasons I wanted to avoid GNU diff and wdiff is because of the restrictive, viral GPL. I have no problem /using/ GPLed programs, but I have no interest in extending or improving them -- which leads me to wonder about your own claims to intellectual property in this code. Here in the United States I don't think any author can avoid a copyright even if he or she doesn't want one. Copyright is created and attached by operation of law, and there is no actual legal entity called "the public domain" that you can assign your copyright to. I think it would be nice to have a non-profit organization whose mission is solely to hold copyrights and refuse to enforce them. In the meantime, here is the verbiage I use on my code; I'm not completely convinced it will actually work, but you might want to adopt it as well: /* Copyright-Only Dedication (based on United States law) The person or persons who have associated their work with this document (the "Dedicators") hereby dedicate whatever copyright they may have in the work of authorship herein (the "Work") to the public domain. Dedicators make this dedication for the benefit of the public at large and to the detriment of Dedicators' heirs and successors. Dedicators intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights under copyright law, whether vested or contingent, in the Work. Dedicators understand that such relinquishment of all rights includes the relinquishment of all rights to enforce (by lawsuit or otherwise) those copyrights in the Work. Dedicators recognize that, once placed in the public domain, the Work may be freely reproduced, distributed, transmitted, used, modified, built upon, or otherwise exploited by anyone for any purpose, commercial or non-commercial, and in any way, including by methods that have not yet been invented or conceived. */ I suspect that your own code may need to be "hardened" against particularly ill-formed files, and might possibly be enhanced to satisfy other needs, or could even become the back end for a visual tool for those users who need it. I'd be happy to route enhancements or bug fixes back to you if I have permission to use the code in other ways. From jon.ingram at gmail.com Sat Mar 13 12:06:31 2010 From: jon.ingram at gmail.com (Jon Ingram) Date: Sat, 13 Mar 2010 20:06:31 +0000 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: <4B9BE8D9.5020904@novomail.net> References: <379bb.624c0680.38c837db@aol.com> <4B9BE8D9.5020904@novomail.net> Message-ID: <4baf53721003131206gde7e5cl761bb0f8706adc32@mail.gmail.com> On 13 March 2010 19:34, Lee Passey wrote: > > In the meantime, here is the verbiage I use on my code; I'm not completely > convinced it will actually work, but you might want to adopt it as well: > > /* > Copyright-Only Dedication (based on United States law) > > The person or persons who have associated their work with this > document (the "Dedicators") hereby dedicate whatever copyright they > may have in the work of authorship herein (the "Work") to the > public domain. > > Dedicators make this dedication for the benefit of the public at > large and to the detriment of Dedicators' heirs and successors. > Dedicators intend this dedication to be an overt act of > relinquishment in perpetuity of all present and future rights > under copyright law, whether vested or contingent, in the Work. > Dedicators understand that such relinquishment of all rights > includes the relinquishment of all rights to enforce (by lawsuit > or otherwise) those copyrights in the Work. > > Dedicators recognize that, once placed in the public domain, the > Work may be freely reproduced, distributed, transmitted, used, > modified, built upon, or otherwise exploited by anyone for any > purpose, commercial or non-commercial, and in any way, including > by methods that have not yet been invented or conceived. > */ > This sounds quite similar to the 'Creative Commons Zero' licence: http://creativecommons.org/publicdomain/zero/1.0/ -- Jon Ingram -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Sat Mar 13 17:51:20 2010 From: jimad at msn.com (James Adcock) Date: Sat, 13 Mar 2010 17:51:20 -0800 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: <43aab.ad1a0fd.38cc85f8@aol.com> References: <43aab.ad1a0fd.38cc85f8@aol.com> Message-ID: >but hey, for the sake of completeness of the thread, how about you quickly run through how i'd "use a reg-ex editor" for this? because i honestly don't know. (and i _do_ know how to wdiff.) On Vim I type: :/[{|}]/ Which highlights the edits and takes me to the next set of edits to choose from, thereafter I just type ?n? to move to the next set of fixes that I need to deal with. I like Vim because I can just keep my fingers on the keyboard where they belong while editing and not have to mess with the mouse. When you are versioning it is frequently not as simple as ?choose A? or ?choose B? but often a mix of both that you have to edit. And I like seeing each next to each other in context to help figure out what the ?correct? editing moves are. IE if A is the target then maybe it has a word with a scanno, and B has the word without the scanno but with an incorrect capitalization. For example if B is an old PG text you are versioning then it may have ?italics? in ALL CAPS whereas A had it in real italics which increases the chance of the A OCR making a scanno. Also these are often the edits are a mixture of inserts, deletes, and substitutions. PS: You criticize me for doing that which the creator of wdiff said he would do if only he had the gumption. PPS: How do you use wdiff to recover lost linebreaks? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Sat Mar 13 17:58:41 2010 From: jimad at msn.com (James Adcock) Date: Sat, 13 Mar 2010 17:58:41 -0800 Subject: [gutvol-d] [SPAM] RE: Re: New Tool "pgdiff" In-Reply-To: <4B9BE8D9.5020904@novomail.net> References: <379bb.624c0680.38c837db@aol.com> <4B9BE8D9.5020904@novomail.net> Message-ID: I decline to attach any verbiage at all. I tell you I wrote it and you can use it any way you like -- at your own risk and amusement, obviously. If you need to get more serious than that contact me by email and we can talk about it. If you find bugs in it or difficulties porting to other platforms I would like to know about it. I recommend the code not be used by NASA. I have written code that others potentially depended on for life and limb and I would rather not have to go there again. From Bowerbird at aol.com Sun Mar 14 14:42:38 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 14 Mar 2010 17:42:38 EDT Subject: [gutvol-d] [SPAM] re: any arguments against "free-range" proofing? Message-ID: nobody came up with many objections to "free-range" proofing... *** keith said: > I do not see anything truely speaking against such a system. > The only problems are the administrative tasks involved. thanks for the feedback keith... > 1) you have to track all this. i think that's pretty easy. when they hit the "save/confirm" button, if there were changes made, then they have _saved_ a new version, which will then set up a "diff appointment" with any prior proofers. or, if there were no changes made, it's registered as a "certification". once a page has two consecutive certifications, it's marked as "done". (any free-range proofers can still proof the page, of course, but once all the pages are marked as "done", the book is ready to be finished.) the "diff appointment" can be resolved in one of three different ways. the first proofer can say "i goofed", or the second one can say "oops!" either of these actions mean one person loses points, and one gains... the third resolution -- when they can't come to a mutual agreement -- comes via a referee, who decides a winner and rewrites documentation so the issue doesn't come up again. points might or might not be lost. (or deducted points might be doubled, if a ref was called unnecessarily.) the purpose of diff review is simply to train correct proofing and coding. people who continue making bad changes might be asked to leave, but i don't anticipate it happening very often. people like to do a good job. > 2) keep everything store somewhere each subsequent saved-text will be stored, for subsequent diff tests. upon saving, it will be compared to all of the earlier saved versions, to see if it is a revert to an earlier save. if it is, it will be dealt with appropriately, depending on the resolution of that earlier version... any proofer will be able to step through all the versions of each file, viewing which changes were made. i expect that some proofers will specialize in this particular tactic, making sure every change is good. > 3) keep everything in sync i've had my share of sync problems in the past, coding e-book authoring-tools, so i think i know where all the pitfalls are now. :+) which is not to say i won't fall in some of 'em again sometimes. but i can usually figure out pretty quickly now what i did wrong. > you will need an authority/ies that finally certify that > a page satisfies your criteria as being done.? that's easy. if the text is .zml that creates .html which looks like the page-scan, then that satisfies the criteria. the proofers view .html output, so can see for themselves. > Some may call it a administrative nightmare, > but it should be workable. yes, i think i can make it work. again, thanks for the feedback. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sun Mar 14 15:58:29 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 14 Mar 2010 18:58:29 EDT Subject: [gutvol-d] Re: New Tool "pgdiff" Message-ID: <103fd.1a43a953.38cec415@aol.com> jim said: > I decline to attach any verbiage at all. > I tell you I wrote it and you can use it any way you like > -- at your own risk and amusement, obviously. except that some of that "verbiage" was people asking just how exactly your program differs from one that they've been using all along. don't you wanna tell 'em? and as for me, perhaps you noticed i congratulated you for programming a tool. was that just "verbiage" to you? in addition, i will analyze any new tool, to check how well it performs the job for which it is intended. it's fine if you don't want to discuss it, but such a review is not "verbiage". it's necessary to take an objective look at our tools to see if they do the job, how they can do it better, and so on... you specifically said your tool helps in 3 areas: 1. line-break recovery 2. error-flagging 3. versioning you even said: > my new tool has several tricks > that haven?t been seen before (if anything has been "verbiage" in this thread, it's that!) so, at the end of this post, i'll begin to look at those 3 areas. > If you need to get more serious than that > contact me by email and we can talk about it. imagine the d.p. people had told you to make your complaints "via e-mail". i'd venture a guess that you would laugh at that... *** > On Vim I type: > :/[{|}]/ > Which highlights the edits and takes me to the next set of edits but that selects both the options, and the surrounding characters. that's not really what you want -- what _most_people_ would want. and it involves typing. either typing or a lot of delicate deleting. both of which increase the probability that errors are introduced. > When you are versioning it is frequently not as simple as > ?choose A? or ?choose B? but often a mix of both that > you have to edit. i'm sure i know the reality much better than you do, jim, because i've actually _done_ this resolution job, for lots and lots of books. but maybe rather than schooling me personally, you've said this for the benefit of the lurkers who might not have thought about it very much, if at all. (and that's an entirely appropriate thing to do.) but if we're going to enlighten them, let's do it properly, ok? your word "frequently" is simply (but completely) out of place. in the vast majority of cases (96%) where there is a difference between the two versions, _one_ of the versions is _correct_... there _are_ some cases where both are incorrect, meaning that you need to do some editing, but such cases are relatively rare. in the last book for which i did a comparison, gardner's text, there were 159 differences. there were only _3_ cases where _both_ versions were incorrect. so yes, it happens, but rarely. > And I like seeing each next to each other in context > to help figure out what the ?correct? editing moves are. oh yeah, the context is _crucial_. but i'm not sure that your _display_ is the optimal one... it takes a lot of visual parsing to figure out a diff like this: > no way. There { warn't | wam't } a window to it big enough personally, i find this display _much_ easier to understand: > no way. There warn't a window to it big enough > no way. There wam't a window to it big enough > ================^^============================ (i hope the monospaced font came through. if so, you'll see the "^^" markers line up with the diff.) and i believe most users would agree that this display is better. but, you know, if some users like _your_ display better, _fine!_ :+) oh, and one more note on "context". sometimes it can fool you. the choice that looks right might not be what was in the book... that's why it's vitally important that your tool show you the scan. otherwise, you're doing your edits blind... > PS: You criticize me for doing that which the creator of wdiff > said he would do if only he had the gumption. you'll need to provide a little more information to be understood. > How do you use wdiff to recover lost linebreaks? i don't use wdiff for that. i wrote my own program. i asked carlo to explain how _he_ does it, but he never answered. i found it humorous he was willing to come out to challenge you, but isn't willing to come out when he is challenged... *** anyway, in order to "kick the tires" on your pgdiff program, jim, i'll set up some files that we can compare. (real books, real files, and not of my own choosing, either, but from rfrank's test-site.) i'll run the files when i next find myself around a p.c. machine... or, if you feel like it, jim, you can run them and post your output. once i have some real output to look at, i'll be able to do a much more thorough review of this new tool. while you're waiting for that, though, here's a screenshot of a tool that i wrote that makes it easier to work with jim's output. > http://z-m-l.com/misc/jim-tool-addon-screenshot.png basically, when it finds a line with a diff in it, it presents the options to the user, who can then click a button to choose one, or enter a number -- 1 or 2 -- to activate the appropriate button. in the case where editing is needed, either option can be edited before the button is clicked to select it. the "stop loop" button will stop the loop that presents the next diff display; otherwise, the app loops through the entire file, jumping to the next diff. so, you see jim, i'm really trying to _help_ you in your quest here. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Mon Mar 15 00:40:10 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 15 Mar 2010 08:40:10 +0100 Subject: [gutvol-d] Re: [SPAM] re: any arguments against "free-range" proofing? In-Reply-To: References: Message-ID: <16186B3D-DA3B-44E8-9B60-4435264D4779@uni-trier.de> Hi BB, You are basically, taking the standard approach to the problem. You did not need to explain. More interesting would be how you track everything. I believe you will need some form of a database for the points system, who made what version, when, is a page done, etc.. I wish you luck. Looks promising. regards Keith. Referees will have to special rights for changing it. Am 14.03.2010 um 22:42 schrieb Bowerbird at aol.com: > nobody came up with many objections to "free-range" proofing... > > *** > > keith said: > > I do not see anything truely speaking against such a system. > > The only problems are the administrative tasks involved. > > thanks for the feedback keith... > > > > 1) you have to track all this. > > i think that's pretty easy. when they hit the "save/confirm" button, > if there were changes made, then they have _saved_ a new version, > which will then set up a "diff appointment" with any prior proofers. > > or, if there were no changes made, it's registered as a "certification". > once a page has two consecutive certifications, it's marked as "done". > (any free-range proofers can still proof the page, of course, but once > all the pages are marked as "done", the book is ready to be finished.) > > the "diff appointment" can be resolved in one of three different ways. > the first proofer can say "i goofed", or the second one can say "oops!" > either of these actions mean one person loses points, and one gains... > > the third resolution -- when they can't come to a mutual agreement -- > comes via a referee, who decides a winner and rewrites documentation > so the issue doesn't come up again. points might or might not be lost. > (or deducted points might be doubled, if a ref was called unnecessarily.) > > the purpose of diff review is simply to train correct proofing and coding. > people who continue making bad changes might be asked to leave, but > i don't anticipate it happening very often. people like to do a good job. > > > > 2) keep everything store somewhere > > each subsequent saved-text will be stored, for subsequent diff tests. > > upon saving, it will be compared to all of the earlier saved versions, > to see if it is a revert to an earlier save. if it is, it will be dealt with > appropriately, depending on the resolution of that earlier version... > > any proofer will be able to step through all the versions of each file, > viewing which changes were made. i expect that some proofers will > specialize in this particular tactic, making sure every change is good. > > > > 3) keep everything in sync > > i've had my share of sync problems in the past, coding e-book > authoring-tools, so i think i know where all the pitfalls are now. :+) > > which is not to say i won't fall in some of 'em again sometimes. > but i can usually figure out pretty quickly now what i did wrong. > > > > you will need an authority/ies that finally certify that > > a page satisfies your criteria as being done. > > that's easy. if the text is .zml that creates .html which > looks like the page-scan, then that satisfies the criteria. > the proofers view .html output, so can see for themselves. > > > > Some may call it a administrative nightmare, > > but it should be workable. > > yes, i think i can make it work. again, thanks for the feedback. > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Mon Mar 15 01:12:47 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 15 Mar 2010 09:12:47 +0100 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: <103fd.1a43a953.38cec415@aol.com> References: <103fd.1a43a953.38cec415@aol.com> Message-ID: <8949C23D-FED3-47F8-AB1F-5629CC201107@uni-trier.de> Am 14.03.2010 um 23:58 schrieb Bowerbird at aol.com: > jim said: > > I decline to attach any verbiage at all. > > I tell you I wrote it and you can use it any way you like > > -- at your own risk and amusement, obviously. [snip, snip] > > your word "frequently" is simply (but completely) out of place. > > in the vast majority of cases (96%) where there is a difference > between the two versions, _one_ of the versions is _correct_... > there _are_ some cases where both are incorrect, meaning that > you need to do some editing, but such cases are relatively rare. > > in the last book for which i did a comparison, gardner's text, > there were 159 differences. there were only _3_ cases where > _both_ versions were incorrect. so yes, it happens, but rarely. True enough. Yet, the arguement stands. At least in my opinion. The trivial cases are easy to handle, yet it is always the RARE cases where tools can shine and set themselves apart from the rest. > > > > And I like seeing each next to each other in context > > to help figure out what the ?correct? editing moves are. > > oh yeah, the context is _crucial_. > > but i'm not sure that your _display_ is the optimal one... > it takes a lot of visual parsing to figure out a diff like this: > > no way. There { warn't | wam't } a window to it big enough > > personally, i find this display _much_ easier to understand: > > no way. There warn't a window to it big enough > > no way. There wam't a window to it big enough > > ================^^============================ > > (i hope the monospaced font came through. if so, > you'll see the "^^" markers line up with the diff.) > > and i believe most users would agree that this display is better. > > but, you know, if some users like _your_ display better, _fine!_ :+) Actually, both methods are kind of primitive from a Human Interface standpoint. a better way would be having two windows containing two or more lines above and below the diff and marking each. If you ever work with critical editions you will understand the cavet of this method. The changes can then be made in a third. All can be enhanced with colors and other neat features. > > oh, and one more note on "context". sometimes it can fool you. > the choice that looks right might not be what was in the book... > that's why it's vitally important that your tool show you the scan. > otherwise, you're doing your edits blind... Very true. regards Keith. P.S. There will always more than one way to skin a cat! -------------- next part -------------- An HTML attachment was scrubbed... URL: From danweber at mindspring.com Mon Mar 15 09:29:27 2010 From: danweber at mindspring.com (Dan Weber) Date: Mon, 15 Mar 2010 12:29:27 -0400 Subject: [gutvol-d] (no subject) Message-ID: <003301cac45c$af100b20$0d302160$@com> To whom it may concern: www.popsci.com This site has 137 years of Popular Science magazine page scans online for free. danweber at mindspring.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 15 12:40:47 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 15 Mar 2010 15:40:47 EDT Subject: [gutvol-d] Re: New Tool "pgdiff" Message-ID: <5bb98.38a53811.38cfe73f@aol.com> keith said: > True enough. Yet, the arguement stands. perhaps you didn't catch my entire gist. _of_course_ one needs to allow for the possibility of editing either version, since both might be incorrect. as i pointed out, my tool (which supports jim's tool) does exactly that. > At least in my opinion. The trivial cases are > easy to handle, yet it is always the RARE cases > where tools can shine and set themselves > apart from the rest. but jim's methodology -- where his tool simply _marks_ the differences after which a person uses a reg-ex editor to actually _make_the_changes_ -- handles neither the rare cases nor the trivial well. whereas my tool-in-support-of-his handles both, equally well. a reg-ex editor, by requiring manual editing even in the "trivial" cases, handles neither trivial nor rare very well, in my opinion. my tool-in-support-of-his makes the "trivial" cases, which are by far the most common, trivial to handle, with a mere button-click or keypress. and the user only has to do manual editing in the rare case, where it simply cannot be avoided. > Actually, both methods are kind of primitive > from a Human Interface standpoint. i always appreciate it when someone analyzes my tools. so let's see what you have to say here, keith. > a better way would be having > two windows containing two or more lines > above and below the diff and marking each.? a little bit of context can help elucidate the difference. too much context can bury it, depending on the display. i'd have to see exactly what you mean in order to decide. in my-tool-in-support-of-jim's tool, the change-window is a movable modal, so people can simply look back at the main window if they need to see more than 1 line of context. (i could also put multiple content lines in the top box of the change-window, if feedback indicated people wanted them.) > If you ever work with critical editions > you will understand the cavet of this method. if it impossible for you to explain in words? > The changes can then be made in a third. again, not sure what you really mean here... > All can be enhanced with colors and other neat features. you can always "enhance" anything with "other neat features". the hard part is _coming_up_ with those "other neat features". *** we should remember that my tool-in-support-of-jim's tool isn't how _i_ would do the job. i was just trying to show how to make his tool work better. i've shown how i do the job... here's how i showed diffs with gardner's book, on 23 february: > http://z-m-l.com/go/gardn/gardn-hybrid6.html that laid out the entire book, with diffs in different colors... here's a simple reworking of that file, which i just posted: > http://z-m-l.com/go/gardn/gardn-hybrid7.html this version of the file lets you click a link to see each scan, and gives you radio-buttons where you can select the correct alternative for each diff. (or choose neither if both are wrong.) this is how i would approach this task with an _online_ thrust, working in a collaborative manner. but i'd probably prefer to do it with an _offline_ app instead, since that's more efficient. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 15 14:09:58 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 15 Mar 2010 17:09:58 EDT Subject: [gutvol-d] [SPAM] re: New Tool "pgdiff" Message-ID: <6311e.21050e3b.38cffc26@aol.com> ok, jim, here's some sample files for your tool... i'm using the book "sitka" that rfrank used on his test-site. here's the original text uploaded by rfrank for his proofers: > http://z-m-l.com/go/jimad/sitka0-ocr.txt and here's the text after the proofers were done with it: > http://z-m-l.com/go/jimad/sitka1-pp.txt if you can run that through your tool and share its output, that would be great. or i'll do it, when i next encounter a windows box. :+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbnewby at pglaf.org Mon Mar 15 19:14:03 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Mon, 15 Mar 2010 19:14:03 -0700 Subject: [gutvol-d] Newby/Hart at Illinois symposium April 15-16 Message-ID: <20100316021403.GA26102@pglaf.org> For those in the region, this might be of interest: http://50years.lis.illinois.edu/ PGLAF CEO Greg Newby will join PG founder Michael Hart at a symposium on the U. Illinois campus. Registration is free but limited. The panel with Michael & Greg is scheduled for Thursday April 15 from 1:30-3pm. -- Greg From pterandon at gmail.com Tue Mar 16 02:59:53 2010 From: pterandon at gmail.com (Greg M. Johnson) Date: Tue, 16 Mar 2010 05:59:53 -0400 Subject: [gutvol-d] Re: [SPAM] re: any arguments against "free-range" proofing? Message-ID: From: "Keith J. Schultz" > > More interesting would be how you track everything. I > believe you will need some form of a database for > the points system, Could points be *the* problem? I don't think "points" works well in performance metrics for either high-level professionals or in nonprofits (say, the parents of a Cub Scout den expected to contribute so much volunteer effort per year). For one, it creates the expectation that the reason one contributes to humanity is to get recognition for their effort on a piecemeal basis. That the person who puts in 40 hours a week of effort needs more praise than he or she who put in 35, 30, or 10 hours. Secondly, it creates false hierarchies where you don't allow someone into leadership until they've "reached a level". It's just my philosophical bias that the best nonprofits are not those with the best volunteer awards dinners. -- Greg M. Johnson http://pterandon.blogspot.com From lee at novomail.net Tue Mar 16 07:58:15 2010 From: lee at novomail.net (Lee Passey) Date: Tue, 16 Mar 2010 08:58:15 -0600 Subject: [gutvol-d] Co-operative proofreading Message-ID: <4B9F9C87.4080608@novomail.net> Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch, I decided to try to implement my own vision of a co-operative proofreading process. Anyone wanting to watch me flail about can follow my work at www.ebookcooperative.com. Login as guest, no password. Apparently the engineers at Microsoft have not yet figured out how to implement CSS percentages, and I haven't had the time (or inclination) to build an Internet Explorer-aware implementation yet, so visitors would be advised to use a different browser. From dakretz at gmail.com Tue Mar 16 08:58:25 2010 From: dakretz at gmail.com (don kretz) Date: Tue, 16 Mar 2010 08:58:25 -0700 Subject: [gutvol-d] Re: Co-operative proofreading In-Reply-To: <4B9F9C87.4080608@novomail.net> References: <4B9F9C87.4080608@novomail.net> Message-ID: <627d59b81003160858s7dc274a5w88c29adedcd607d0@mail.gmail.com> Nice one! I can see we're going through the classic learning curve! Let a thousand flowers bloom! On Tue, Mar 16, 2010 at 7:58 AM, Lee Passey wrote: > Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch, I decided to > try to implement my own vision of a co-operative proofreading process. > Anyone wanting to watch me flail about can follow my work at > www.ebookcooperative.com. Login as guest, no password. > > Apparently the engineers at Microsoft have not yet figured out how to > implement CSS percentages, and I haven't had the time (or inclination) to > build an Internet Explorer-aware implementation yet, so visitors would be > advised to use a different browser. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 16 12:06:49 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 16 Mar 2010 15:06:49 EDT Subject: [gutvol-d] Re: Co-operative proofreading Message-ID: dkretz said: > Nice one! > I can see we're going through the classic learning curve! > Let a thousand flowers bloom! you seem like a nice-enough fellow, don. so why are all those people over at d.p. throwing rocks at you? ;+) *** lee said: > Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch who are all these people? hey lee, you left out dkretz. who has done more than anyone, including rfrank. don was instrumental in producing dp-canada. i have not talked about dp-canada because i was banned from it before it even started by one of the crazy people involved with it. but it seems to be limping along just fine, as far as i know, so if anyone wants to start a site, i'd advise you to look at dp-canada. or talk to don. no one has come nearly as close to me in inspiring the d.p. rock-throwers as don; he must be doing something right. > I decided to try to implement > my own vision of a co-operative proofreading process. good luck! i'd give you some feedback, but since you're in my spam folder and all, dialog would be stilted, so you're on your own, passey. it would be nice if a thousand flowers bloomed, but my bet is 999 will die on the vine. which might also be fine, i don't know. i just know i won't bother to keep count until there's a shakeout... but geez louise, even some people from d.p. itself now seem to be newly motivated to _do_something_. which would be a good sign, except they're planning a facelift to a system with structural damage. it might _look_ a little better, but it won't really _work_ any better. but -- largely due to rfrank's competition -- they're now _trying_... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jon.ingram at gmail.com Tue Mar 16 12:39:05 2010 From: jon.ingram at gmail.com (Jon Ingram) Date: Tue, 16 Mar 2010 19:39:05 +0000 Subject: [gutvol-d] Re: Co-operative proofreading In-Reply-To: <4B9F9C87.4080608@novomail.net> References: <4B9F9C87.4080608@novomail.net> Message-ID: <4baf53721003161239p7d1c5e81n7ca4163b5ffc0d87@mail.gmail.com> Interesting, and it's good to see someone using a rich text editor for the text, rather than expecting proofers to mess around with , etc. I'm not sure how the page was supposed to look, however -- - I'm using a widescreen 1680x1050 monitor, and there was still material off the bottom of the page. This is using Google Chrome. - I couldn't see any way to resize the image, so as to see the page width rather than the zoomed in image, which gives me about 4 words before I have to scroll to the right - I couldn't see any way to change the font in the text window, preferably to dpcustommono, which is ugly, but is the best font I've yet used for proofing. - It would be nice to have some instructions. What exactly are you expecting me to do to the page? Do you want headers/footers/page numbers to be kept? Do you want end of line hyphens kept? Do you want paragraphs joined? - It would be nice to have (the option of) a horizontal rather than vertical layout. I used to really like the vertical layout, but found I was more accurate at proofing with a horizontal one. - I really prefer block paragraphs rather than indented ones for computer-based text. A very good implementation so far -- I'll await developments. On 16 March 2010 14:58, Lee Passey wrote: > Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch, I decided to > try to implement my own vision of a co-operative proofreading process. > Anyone wanting to watch me flail about can follow my work at > www.ebookcooperative.com. Login as guest, no password. > > Apparently the engineers at Microsoft have not yet figured out how to > implement CSS percentages, and I haven't had the time (or inclination) to > build an Internet Explorer-aware implementation yet, so visitors would be > advised to use a different browser. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Tue Mar 16 12:50:53 2010 From: dakretz at gmail.com (don kretz) Date: Tue, 16 Mar 2010 12:50:53 -0700 Subject: [gutvol-d] Re: Co-operative proofreading In-Reply-To: References: Message-ID: <627d59b81003161250w56295732jf0724b4e7960b064@mail.gmail.com> Whan that Aprill, with his shoures soote The droghte of March hath perced to the roote And bathed every veyne in swich licour, Of which vertu engendred is the flour; Whan Zephirus eek with his sweete breeth Inspired hath in every holt and heeth The tendre croppes, and the yonge sonne Hath in the Ram his halfe cours yronne, And smale foweles maken melodye, That slepen al the nyght with open eye-- (So priketh hem Nature in hir corages); Thanne longen folk to write proofreading software. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vze3rknp at verizon.net Tue Mar 16 13:57:47 2010 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Tue, 16 Mar 2010 16:57:47 -0400 Subject: [gutvol-d] Re: Co-operative proofreading In-Reply-To: <4baf53721003161239p7d1c5e81n7ca4163b5ffc0d87@mail.gmail.com> References: <4B9F9C87.4080608@novomail.net> <4baf53721003161239p7d1c5e81n7ca4163b5ffc0d87@mail.gmail.com> Message-ID: <4B9FF0CB.1070803@verizon.net> I tried it with Chrome and couldn't get the text box to let me edit. It worked fine in Firefox, except that I didn't find a way to save the work, aside from asking for another page and then saying yes when it asked if I wanted to save. Also, it insisted on indenting the line at the top of the page, even when it wasn't the beginning of the paragraph. I, too, find a horizontal interface works better for me. JulietS On 3/16/2010 3:39 PM, Jon Ingram wrote: > Interesting, and it's good to see someone using a rich text editor for > the text, rather than expecting proofers to mess around with , etc. > > I'm not sure how the page was supposed to look, however -- > > - I'm using a widescreen 1680x1050 monitor, and there was still > material off the bottom of the page. This is using Google Chrome. > > - I couldn't see any way to resize the image, so as to see the page > width rather than the zoomed in image, which gives me about 4 words > before I have to scroll to the right > > - I couldn't see any way to change the font in the text window, > preferably to dpcustommono, which is ugly, but is the best font I've > yet used for proofing. > > - It would be nice to have some instructions. What exactly are you > expecting me to do to the page? Do you want headers/footers/page > numbers to be kept? Do you want end of line hyphens kept? Do you want > paragraphs joined? > > - It would be nice to have (the option of) a horizontal rather than > vertical layout. I used to really like the vertical layout, but found > I was more accurate at proofing with a horizontal one. > > - I really prefer block paragraphs rather than indented ones for > computer-based text. > > A very good implementation so far -- I'll await developments. > > On 16 March 2010 14:58, Lee Passey > wrote: > > Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch, I > decided to try to implement my own vision of a co-operative > proofreading process. Anyone wanting to watch me flail about can > follow my work at www.ebookcooperative.com > . Login as guest, no password. > > Apparently the engineers at Microsoft have not yet figured out how > to implement CSS percentages, and I haven't had the time (or > inclination) to build an Internet Explorer-aware implementation > yet, so visitors would be advised to use a different browser. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vze3rknp at verizon.net Tue Mar 16 14:50:18 2010 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Tue, 16 Mar 2010 17:50:18 -0400 Subject: [gutvol-d] Re: any arguments against "free-range" proofing? In-Reply-To: <36c3a.27b91b5.38c98aa2@aol.com> References: <36c3a.27b91b5.38c98aa2@aol.com> Message-ID: <4B9FFD1A.8000409@verizon.net> I do have some thoughts about "free-range" proofing. The size of the corpus that is being proofed is important. The Australian Newspaper project (http://newspapers.nla.gov.au/ndp/del/home) allows a volunteer to proof any article from ~100 yrs of lots of newspapers. They built it so that their readers could improve articles when they found errors. It works very well for that purpose, but it also has several problems. One is that they don't provide any information about whether or not someone has already proofed this article. The proofing interface is totally optional, so if a reader doesn't see any errors, then they don't invoke the interface. From that point of view, it works beautifully. But they didn't make provision for someone who just wants to proof an article, any article. There is no way to say "give me another article". I find it very hard to choose at random when the number of possibilities is so large. Also, since there's no information as to whether or not anyone has looked at (proofed) this article yet, there's no way to know if one is duplicating work already done. Another problem with their system is one of completeness. For example, if they want to know whether an entire issue of a newspaper (1 day) is completely corrected (or at least that someone has edited every article) they can't do it. Part of this can be solved by them keeping track of this information. But, by the nature of their system, with efforts scattered all over the place, it is very unlikely that any one issue will be completely done. For their purposes, that doesn't matter. But when working on things that are meant to be read from beginning to end, it *does* matter. All of this ties in to a sense of progress. If the unit of proofing produces a complete entity (as with an article in a newspaper) then one can count progress by counting how many articles have been done. But if the unit of proofing is not the complete entity (as with a page of a book), then matters change. The whole idea of distributing the work of proofreading is that no one has to feel like they must do an entire book by themselves. With the current systems, a volunteer knows that even if they can't do the entire book themselves, someone else will help out and it will get done. In a free-range system, there is no such assurance that anyone else will want to help finish that book. I guess what I'm saying is that people who proof for the sake of proofing like to see progress. To have a sense of accomplishment while knowing that they contributed. The only way I can see to achieve that in a free-range environment is by limiting the number of books that are currently available. That is, concentrating the work somehow so that eventually a book is completely "done" (or, as good as it's going to get for now). I think that there is a need for both kinds of systems. The free-range system is good for material that is short. It's also good for allowing casual readers to fix something that's wrong. I don't think it works very well as a system for producing entire corrected books. Another issue with a free-range system has to do with abuse. If no one is likely to look again at whatever page I've just done, there is nothing to keep me from changing what it says. Think of it as a kind of graffiti. The Australian Newspaper project hasn't had trouble with that, but I believe that that is because they haven't been going long enough and haven't attracted a wide enough audience yet. I predict that they will have trouble with it eventually. Most people are well-meaning, but there's always the few who have to write "John was here" on a wall, or in an online book. And there will inevitably be a few fanatics who just have to substitute their view of the world, either by carefully changing a few words, or by simply putting an entire tract in place of the text that used to be there. One advantage of many people looking at a single page (or, at least 2) is that it becomes hard to get away with that kind of thing. As long as the proofing effort is relatively small, and not very high profile, a free-range system would probably not have trouble with vandalism. But if the effort were associated with a high profile organization (Google, say) it suddenly it would become much more interesting to folks who like to disrupt. In summary, I think there are three issues that a free-range proofing system must address: choice, completeness, and vandalism. I'm not saying that a free-range system wouldn't work. It obviously can. But I do think that how well it works depends on what its purpose is. JulietS On 3/10/2010 6:52 PM, Bowerbird at aol.com wrote: > the d.p. proofing system locks each page to a single proofer. > (there's one and only one p1 proofer, p2 proofer, and so on.) > > so does rfrank's roundless system; once a page has been > assigned to a proofer, it's semi-difficult to even look at it. > > and if someone else has reproofed it _after_ that person, > then the old version is stored somewhere i can't figure out, > so tracking the diffs simply cannot be done by an outsider. > > (the d.p. system at least allows you to do that tracking, and > even has a routine that will show you round-to-round diffs.) > > it is by analyzing these round-to-round diffs very closely > that you can get a sense for how a page progresses from > the initial o.c.r. to its final -- hopefully perfect -- stage... > > *** > > the question i have today is whether there is a good reason > why a page needs to be assigned-and-locked to one person. > > is there any reason why you shouldn't allow any proofer to > go and proof any page in a book? yes, it would mean that > some pages might be proofed several times, but so what? > that's not necessarily a _bad_ thing, is it? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 16 16:33:18 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 16 Mar 2010 19:33:18 EDT Subject: [gutvol-d] Re: Co-operative proofreading Message-ID: jon said: > Interesting, and it's good to see someone do people really think lee's system is "interesting"? unless i'm missing something, it's just a mockup? it doesn't actually save the text, not that i can see. and the reg-ex cleanup doesn't really work, does it? > Interesting, and it's good to see someone > using a rich text editor for the text, rather than > expecting proofers to mess around with , etc. except there is a tremendous conundrum at work... because we don't want proofers to go "presentational", do we? we want them to make structural distinctions... but with all those presentational w.y.s.i.w.y.g. buttons littering the interface, how would proofers ignore them? > I'm using a widescreen 1680x1050 monitor, and > there was still material off the bottom of the page. > This is using Google Chrome. i had similar problems in safari. camino worked fine. > It would be nice to have (the option of) a horizontal > rather than vertical layout. I used to really like > the vertical layout, but found I was more accurate > at proofing with a horizontal one. i don't understand the appeal of a horizontal interface -- way too much scrolling for me! -- but lots of people seem to prefer it. > I really prefer block paragraphs rather than > indented ones for computer-based text. but you want the display to look like the p-book, not? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 16 16:20:50 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 16 Mar 2010 19:20:50 EDT Subject: [gutvol-d] Re: any arguments against "free-range" proofing? Message-ID: juliet said: > The Australian Newspaper project allows a volunteer to > proof any article from ~100 yrs of lots of newspapers. well, that's quite different from what i'm talking about, which is to allow people to proof any page of a _book_, one that is being actively worked on at the present time. > Also, since there's no information as to whether or not > anyone has looked at (proofed) this article yet, there's > no way to know if one is duplicating work already done. again, quite different. i will explicitly inform people of the exact status of every page. however, if they _want_ to "duplicate" work that is "already done" -- by proofing a page that's already "finished" -- they can certainly do so. indeed, up through the 3rd "confirmation" a page is "done", the person would continue to receive "points" for doing so... the main reason a person would go the "free-range" route, i would think, would be so they could actually read the book in the process of proofing. i think that's a useful perspective. another reason might be to do a "specialized" look at the book. for instance, i think it'd be great for a person to look through the entire book just to find cases of _italics_ and _formatting_. even a pass checking the paragraph-starts at page-tops will be a useful quality-control mechanism i'd want to encourage. > All of this ties in to a sense of progress. indeed. > In a free-range system, there is no such assurance that > anyone else will want to help finish that book. i think it's just the opposite. if i inform people which pages haven't yet been proofed, many people would choose them. if i show people which pages need to be confirmed, i think some people will want to get their "points" that way instead. and other people will want to read straight through the book, without any regard for the state of any one particular page... by letting them choose whatever they like, rather than just _assigning_ them a page, with a choice to "take it or leave it", i think they're going to do a good job of progressing a book. > The only way I can see to achieve that > in a free-range environment is by > limiting the number of books that are currently available. which is, i think, a perfectly good way to achieve that goal. the idea that d.p. seems to have settled upon is that anyone can put a book in the system, without anybody knowing who -- if anyone -- will be there at the end to pick up the pieces. as a consequence, you now have two tons of half-done books. and this has nothing to do with "free-range", as evidenced by the fact that rfrank is using a "limited-number" approach in his system, in order to make sure that books don't get beached. > If no one is likely to look again at whatever page I've just done, > there is nothing to keep me from changing what it says. ok, i guess you're talking about the australian newspaper thing. which, again, has no applicability to what i'm talking about, so i won't bother to address it here, except to say that my system is _expressly_and_extensively_ geared to checking changes made. there won't be any "graffiti" that won't be painted over very quickly. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Wed Mar 17 01:35:44 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Wed, 17 Mar 2010 09:35:44 +0100 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: <5bb98.38a53811.38cfe73f@aol.com> References: <5bb98.38a53811.38cfe73f@aol.com> Message-ID: <3930D74D-95CB-412E-B201-4E25BC5963D0@uni-trier.de> Hi BB, All, BB you wanted some clarification so I will try. First, I do not have the time to look closely at either tool nor overs, as it would take to long for me to properly analyze them, and due the analysis merit. So I will proppose a possible design. Take the good, leave the bad. Second the design is taken from tools used for creating critical editions. A critical edition is where two or more versions of a text are put side by side and commented on. It is used most on historical texts, translations, etc. Some of this is possibly overkill. In my opinion would be a tool with four windows/frames: 1) the scan 2) version 1 3) version 2 4) the proofed version Yes, BB, there is the problem of screen clutter and the possibility of offering to much information. Yet, a single line for 2 and 3 may be to little as the co-text may offer hints for a possible correct version. Then again we are proofing and since we can assume that a proofer has done this regularly s/he will easily adjust. The scan (1) would be optimally syncrhonised with the passage being checked. 2 and 3 should have at least three lines of text each. Generally, 5 would be better. Though for the purpose of proofing less might do! 4 would be contain the entire new proofed version but sync at first to the conflict being investigated. Now, we have three cases to consider: a) version 1 is correct b) version 2 is correct c) neither 1 or 2 is correct d) the case where 1 and 2 are correct is actually not possible in our context unless version 1 and 2 are comming from different edition. Still we can handle this in the same manner as c. For cases a and b you have a button to accept that version as correct. For case c we could simply fall back into the editor and let the proofer hand edit. Another possiblity would be to offer possible hints for a correction. These could come from: - spellchecker - list of changes already made in the text or entire scan set. The spell checker is trivial. The list is a can of worms by itself. Though it would be a compromise to your text wise change BB. That is we had this before and could be the case. When the proofer is done with the "diffs", fold up the windows of 1 and 2 expand windows/frames for the scan and corrected version check for other possible mistakes and save. Hope this helps. If not hit delete. regards Keith. Am 15.03.2010 um 20:40 schrieb Bowerbird at aol.com: > keith said: > > True enough. Yet, the arguement stands. > > perhaps you didn't catch my entire gist. > > _of_course_ one needs to allow for the possibility of > editing either version, since both might be incorrect. > > as i pointed out, my tool (which supports jim's tool) > does exactly that. > > > > At least in my opinion. The trivial cases are > > easy to handle, yet it is always the RARE cases > > where tools can shine and set themselves > > apart from the rest. [snip, snip] > > Actually, both methods are kind of primitive > > from a Human Interface standpoint. > > i always appreciate it when someone analyzes my tools. > > so let's see what you have to say here, keith. > > > > a better way would be having > > two windows containing two or more lines > > above and below the diff and marking each. > > a little bit of context can help elucidate the difference. > too much context can bury it, depending on the display. > i'd have to see exactly what you mean in order to decide. > > in my-tool-in-support-of-jim's tool, the change-window > is a movable modal, so people can simply look back at the > main window if they need to see more than 1 line of context. > > (i could also put multiple content lines in the top box of the > change-window, if feedback indicated people wanted them.) > > > > If you ever work with critical editions > > you will understand the cavet of this method. > > if it impossible for you to explain in words? > > > > The changes can then be made in a third. > > again, not sure what you really mean here... > > > > All can be enhanced with colors and other neat features. > > you can always "enhance" anything with "other neat features". > the hard part is _coming_up_ with those "other neat features". > > *** > > we should remember that my tool-in-support-of-jim's tool > isn't how _i_ would do the job. i was just trying to show how > to make his tool work better. i've shown how i do the job... > > here's how i showed diffs with gardner's book, on 23 february: > > http://z-m-l.com/go/gardn/gardn-hybrid6.html > > that laid out the entire book, with diffs in different colors... > > here's a simple reworking of that file, which i just posted: > > http://z-m-l.com/go/gardn/gardn-hybrid7.html > > this version of the file lets you click a link to see each scan, > and gives you radio-buttons where you can select the correct > alternative for each diff. (or choose neither if both are wrong.) > > this is how i would approach this task with an _online_ thrust, > working in a collaborative manner. but i'd probably prefer to > do it with an _offline_ app instead, since that's more efficient. > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Mar 17 09:35:53 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Mar 2010 12:35:53 EDT Subject: [gutvol-d] Re: New Tool "pgdiff" Message-ID: keith said: > BB you wanted some clarification so I will try. um, ok, i guess. > First, I do not have the time to look closely at either > tool nor overs, as it would take to long for me to > properly analyze them, and due the analysis merit. um, ok, i guess. :+) > So I will proppose a possible design. > Take the good, leave the bad. um, ok, i guess. ;+) > In my opinion would be a tool with four windows/frames: > 1) the scan > 2) version 1 > 3) version 2 > 4) the proofed version ... > The scan (1) would be optimally syncrhonised > with the passage being checked. ... > Now, we have three cases to consider: > a) version 1 is correct > b) version 2 is correct > c) neither 1 or 2 is correct > d) the case where 1 and 2 are correct is > actually not possible in our context > unless version 1 and 2 are comming from different edition. > Still we can handle this in the same manner as c. ... > Hope this helps. If not hit delete. ok. now i'm curious... :+) keith, how do you think i pull off all the comparisons i do? how do you think i can sling around lists of diffs like i do? > http://z-m-l.com/go/gardn/gardn-hybrid6.html how do you think i can mount entire books with diffs? > http://z-m-l.com/go/gardn/gardn-hybrid7.html how do you think i resolve the diffs in all the books i do? i can tell you how i do it! i do it with tools i've programmed that do all the things that you talk about, and more. that's how i do it, keith. so you don't have to do hypothetical writeups, keith, especially if you're short on time, because i have a big batch of post-hypothetical reality sitting in my toolbox. it's not that your writeup doesn't "help". it's that we are past that point in time... and we have been, for a while. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From richfield at telkomsa.net Wed Mar 17 09:49:28 2010 From: richfield at telkomsa.net (Jon Richfield) Date: Wed, 17 Mar 2010 18:49:28 +0200 Subject: [gutvol-d] Re: (no subject) In-Reply-To: <003301cac45c$af100b20$0d302160$@com> References: <003301cac45c$af100b20$0d302160$@com> Message-ID: <4BA10818.4090209@telkomsa.net> Dan, yes it concerns me, but I cannot find those scans on that page. Should I be looking more intelligently? Cheers, Jon On 2010/03/15 18:29 PM, Dan Weber wrote: > > To whom it may concern: > > www.popsci.com > > This site has 137 years of Popular Science magazine page scans online > for free. > > danweber at mindspring.com > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Mar 17 11:44:02 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Mar 2010 14:44:02 EDT Subject: [gutvol-d] Re: Co-operative proofreading Message-ID: <30421.2ce5a884.38d27cf2@aol.com> i said: > unless i'm missing something, it's just a mockup? while i don't think a mockup is very hard to do, not when compared to the programming per se, it's not like i'm mocking mockups... so if you're interested in mockups, there are more. here's one from dkretz: > http://www.pgdp.org/~dkretz/c/editpagedemo.php?projectid=projectID45c572a149feb&pagecode=70298&round=F1&taskcode=F1 here's a page giving several from cpeel: > http://www.pgdp.org/~cpeel/prototypes/ -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From lee at novomail.net Wed Mar 17 12:06:00 2010 From: lee at novomail.net (Lee Passey) Date: Wed, 17 Mar 2010 13:06:00 -0600 Subject: [gutvol-d] Re: Co-operative proofreading In-Reply-To: <4B9FF0CB.1070803@verizon.net> References: <4B9F9C87.4080608@novomail.net> <4baf53721003161239p7d1c5e81n7ca4163b5ffc0d87@mail.gmail.com> <4B9FF0CB.1070803@verizon.net> Message-ID: <4BA12818.70101@novomail.net> First of all, let me say that I am gratified by, and appreciative of those who have visited the site and offered feedback. Be aware that what you are seeing is more than a mock-up and less than a prototype; it is, in fact, my workbench. As my development process proceeds, I deploy software and files to that site for testing and evaluation. There is no guarantee that the behavior or appearance today will be the behavior or appearance tomorrow. What I intended was to provide a window into my development process. When I invited people to watch me flail about, that was exactly what I meant. On 3/16/2010 2:57 PM, Juliet Sutherland wrote: > I tried it with Chrome and couldn't get the text box to let me edit. > It worked fine in Firefox, except that I didn't find a way to save the > work, aside from asking for another page and then saying yes when it > asked if I wanted to save. The save button is the little "floppy disk" icon in the formatting toolbar, next to the "block type" drop down box. > Also, it insisted on indenting the line at > the top of the page, even when it wasn't the beginning of the paragraph. This is an artifact of the OCR process. To the best of my knowledge, no OCR program is capable of starting a page and recognizing that the text is, in fact, a continuation of the text on a previous page. As bowerbird has suggested, I have named my working files sequentially and in synchronization with the image files. My intent is to enhance my post-processing program a bit so that it will look at the first paragraph of one page together with the last paragraph on the preceding page. If the first does /not/ begin with a majuscule and the following does /not/ end with line terminating punctuation, I would mark the paragraph 'class="continuation".' The editor's CSS would not indent paragraphs of that class, and the merge program (which would create a single file of all the component files) would merge paragraphs when the class was encountered. Of course, this algorithm could create false positives where the OCR drops punctuation, or doesn't recognize capitalization, and create false negatives where sentences, but not paragraphs, begin on a new page. There will need to be a yet-to-be-determined method for the user interface to allow a proofreader to make this distinction. > I, too, find a horizontal interface works better for me. > > JulietS So before continuing, let me explain a little of my strategy and tactics. I am a firm believer in markup. Like Mr. Frank, I believe that the markup should be carried though with the text at every stage of the process. I am a firm believer in internet standards, even unofficial, de-facto internet standards. No re-inventing any wheels for me. Lastly, I am an extraordinarily lazy programmer. I'm not going to write any new code unless I absolutely have to. I will not, however, use any code infected by the Gnu Public License. Standalone programs are fine, but I won't touch GPL code with Mr. Haines' ten-foot pole. So... There is nothing any nearer to a standard for e-books than HTML. I decided that the original OCR should produce HTML output and that the markup should stick with the text until the final single file is created. Because of this decision, the final single file could be created over and over as small tweaks to the component files were made; there would be no need for any concept of finality or "doneness." I discovered that both the Plone and the Apache Lenya content management systems used a javascript-based visual HTML editor called Kupu. Kupu is now part of the Apache Lenya project and the source is available from apache.org under the Apache license. The editor you see at my website is Kupu, unmodified except for modifications to the CSS file that governs how it is displayed. I am assuming that at some point I will have to make slight modifications to the Kupu code, but that will be among the last things I do. I need to get the underlying workflow nailed down first. If there is anyone who wants to help out by tackling the Kupu interface (cough, Carel, cough) I would welcome the help. Your comments about the behavior of my site with Chrome makes me wonder how well Chrome is supported by the Apache Lenya project; maybe I should ping them to try it out. I needed a repository to track all the individual files for each project, and the changes thereto. Well, there's tons of software and applications that supports CVS, so CVS it is. The current plan is to have /three/ mostly-identical CVS repositories for each project. As registered users select a project each will be assigned to that repository which has been least-used. While the editor contents can be saved as many times as a user wants, when the user leaves a page (or after an appropriate timeout) the file will be committed to its repository. Upon commitment a file will be "diffed" against the other two repositories. When conflicts are found, a voting algorithm will resolve the conflict, if possible, and the changes will be committed to /all three/ repositories. The algorithm will not be a pure "two out of three," but will be weighted based on the number of users who have view a page. Hopefully, this kind of algorithm can minimize the problem of e-graffiti. If a "vote" is two close to call, both options will be placed in all the committed files in a manner similar to that proposed by Mr. Adcock. My biggest problem here is finding a "diff" application that can work as I need it to. Hmm, if I'm going to make users register and login, and if I'm going to track things like which repositories they have been assigned to, I'm going to need some kind of data store. My site uses the Apache web server, and has MySQL installed. Apache's authdb module can use MySQL as the authentication database. I guess all the data I need to track will be stored in MySQL (and I haven't even /started/ to think about how to define the tables I need). Now all I need is some glue to hold all the pieces together. I'm an accomplished Java programmer, familiar with JDBC and servlets. My site's server has Apache Tomcat installed and available. I guess that decision's a no-brainer. So, there's my strategy and some of the tactics to the extent I have worked them out. Now a few specific responses. > On 3/16/2010 3:39 PM, Jon Ingram wrote: >> Interesting, and it's good to see someone using a rich text editor for >> the text, rather than expecting proofers to mess around with , etc. As pointed out above, the editing window technically is not a rich text editor (which produces output in RTF format). It is the Kupu HTML editor, which I am still not very familiar with. But I agree that proofreaders need a tool where they can make the proofed text look like the scanned image. One of the things I like about Kupu is the little "scroll" button, which brings up a plain text editor where you /can/ edit the HTML source directly if you desire. I also need to add a method to add internal anchors, and a method to build tables of contents. >> I'm not sure how the page was supposed to look, however -- >> >> - I'm using a widescreen 1680x1050 monitor, and there was still >> material off the bottom of the page. This is using Google Chrome. It appears that either Chrome hasn't figured out how to use the CSS "percentage" value either, or perhaps it's understanding simply differs from that of Mozilla (from your description, I would guess the latter). I could go on at length about how /I/ think it should be implemented, but I won't. >> - I couldn't see any way to resize the image, so as to see the page >> width rather than the zoomed in image, which gives me about 4 words >> before I have to scroll to the right Resizing images is a problem. Right now, images are the size that FineReader exported them. Firefox autosizes the images into the constraining box, and provides a "zoom" function. Apparently Chrome does not have any sort of similar function (IE definitely does not), and Opera works even worse than IE. Maybe I can come up with an automated tool to resize the images into a set of standard resolutions (e.g. 25%, 50%, 75%, 100%). Then each user could individually set the preferences for the image size that works best for her or him. If the editor and image boxes are going to be fixed sizes perhaps I could add those parameters to a set of preferences as well. >> - I couldn't see any way to change the font in the text window, >> preferably to dpcustommono, which is ugly, but is the best font I've >> yet used for proofing. Well, you wouldn't want to set the font face or size for the file being saved, as that is a highly subjective matter. Unfortunately I have yet to see a browser that allows a user to override a page's stylesheet decision (although Opera is getting close). I've no experience yet with Chrome, does it do so? What I envision is allowing a user to select among a set of standard CSS style sheets as a sticky preference, or to actually upload his or her own for personal use. >> - It would be nice to have some instructions. What exactly are you >> expecting me to do to the page? Do you want headers/footers/page >> numbers to be kept? Do you want end of line hyphens kept? Do you want >> paragraphs joined? Guilty as charged. I'm thinking of adding a "Proofing Guidelines" button to each page, which would popup a separate window with those instructions. Of course, at this stage of development I have virtually no idea as to what those instructions would be, but it might be a good idea to add it now anyway, even if the instructions are as simple as "I know I have to add this in the future." >> - It would be nice to have (the option of) a horizontal rather than >> vertical layout. I used to really like the vertical layout, but found >> I was more accurate at proofing with a horizontal one. This could be handled by (yet another) user preferred stylesheet. >> - I really prefer block paragraphs rather than indented ones for >> computer-based text. A user preferred stylesheet could handle this issue as well, although if you were to do it you would need to figure out how to insert a visual signal when a paragraph is a "continuation" paragraph as opposed to a "real" paragraph. >> A very good implementation so far -- I'll await developments. Thank you. Just as a reminder, however, I suspect the user interface portion will not receive much attention until the latter stages of development; for now, I only need it to work well enough for me to test other parts of the workflow. From jimad at msn.com Wed Mar 17 12:09:59 2010 From: jimad at msn.com (Jim Adcock) Date: Wed, 17 Mar 2010 12:09:59 -0700 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: <103fd.1a43a953.38cec415@aol.com> References: <103fd.1a43a953.38cec415@aol.com> Message-ID: >except that some of that "verbiage" was people asking just how exactly your program differs from one that they've been using all along. don't you wanna tell 'em? You are attacking my reply re attaching licensing terms by attaching it to unrelated discussions. >> On Vim I type: >> :/[{|}]/ >> Which highlights the edits and takes me to the next set of edits >but that selects both the options, and the surrounding characters. >that's not really what you want -- what _most_people_ would want. >and it involves typing. either typing or a lot of delicate deleting. >both of which increase the probability that errors are introduced. Again, you are assuming the problem presupposes a solution which is one of "Choose A" or "Choose B". If you use the tool on other than trivial problems you will find out that life is not that simple, and that frequently both A and B have some degree of errors that need to be corrected and/or merged to get you where you want to go. If one wanted to make a graphical tool to do this you would not only need the "Choose A" and "Choose B" options but "Edit in Context while displaying a copy of the original scanned page" and if one wants to make that kind of tool one would be better off to put the time and effort into figuring out a tool to display a scanned page a bit-mapped line at a time comparing to the OCR text as opposed to the DP current approach of displaying a bit-mapped page at a time compared to a OCR page at a time. And then one would also have to tackle the problem of how one wants to deal with the portability issues of the differing graphics systems on different people's computers. And one would have to build in an editing capability on par with the non-integrated editors that people currently choose to use and/or offer emulation of those editors in your editor offering. These WOULD be good issues to tackle, I just don't feel like I am the right person to tackle these problems. In practice, using pgdiff with Vim I find personally to be MUCH easier, less painful, and more productive than the DP approach, which is why I offer it for people to choose from. You still need to compare to the page scans. >in the vast majority of cases (96%) where there is a difference between the two versions, _one_ of the versions is _correct_... This is not my experience, but in any case it should be obvious that the results are HIGHLY dependent on what kind of texts and OCRs you are working on. >but, you know, if some users like _your_ display better, _fine!_ :+) More importantly, since I post my code and it is reasonably portable without a lot of rigmarole and without stack hacks like wdiff people can edit it and put it into their choice of display or other code. >you'll need to provide a little more information to be understood. Read the wdiff documentation and you will see the author admits he would have written a stand-alone tool that doesn't depend on diff if he could figure out the algorithm. > http://z-m-l.com/misc/jim-tool-addon-screenshot.png ... >so, you see jim, i'm really trying to _help_ you in your quest here. Thank you. Post a portable version or one compiled for windows and I will tell you how it works for me in practice. PS: doesn't really help me with *MY* quest since I have the tools *I* need to do my job the way I want to do it, but granted perhaps other people would be happier with the GUI approach you are suggesting. Since I post the source code they can apply my work however they want to. From richfield at telkomsa.net Wed Mar 17 12:54:00 2010 From: richfield at telkomsa.net (Jon Richfield) Date: Wed, 17 Mar 2010 21:54:00 +0200 Subject: [gutvol-d] Re:To whom it may concern: In-Reply-To: <003301cac45c$af100b20$0d302160$@com> References: <003301cac45c$af100b20$0d302160$@com> Message-ID: <4BA13358.9020207@telkomsa.net> Dan, yes it concerns me, but I cannot find those scans on that page. Should I be looking more intelligently? Cheers, Jon On 2010/03/15 18:29 PM, Dan Weber wrote: > > To whom it may concern: > > www.popsci.com > > This site has 137 years of Popular Science magazine page scans online > for free. > > danweber at mindspring.com > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Attached Message Part URL: From Bowerbird at aol.com Wed Mar 17 13:12:20 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Mar 2010 16:12:20 EDT Subject: [gutvol-d] Re: New Tool "pgdiff" Message-ID: <8329d.6343aa8a.38d291a4@aol.com> jim said: > You are attacking my reply re attaching licensing terms > by attaching it to unrelated discussions. jim, i'm not "attacking" anything. let's steer clear of antagonistic language. and if you had quoted what you were replying to, i would have _known_ the scope of your reply. don't blame me because i'm not a mind-reader. > Again, you are assuming the problem presupposes a solution > which is one of "Choose A" or "Choose B". well, since my tool lets a person _edit_ either choice a _or_ choice b, i think i've given the user plenty of leeway. > If you use the tool on other than trivial problems > you will find out that life is not that simple oh please, jim, wake up. i've actually _done_ the kind of comparisons and diff-resolutions that you're just _talking_about_. moreover, i have done them _many_times_. so i think i have much better _experience_ with the actual situation than you do. so don't try to "school me", ok? > frequently both A and B have some degree of errors that need > to be corrected and/or merged to get you where you want to go. now you're just repeating your "frequently" term, which i have already demonstrated to be false, using the most recent example i have shared. in all the resolutions i've done, the vast majority of diffs involved cases where _one_ of the versions was correct. very rarely were both wrong... > If one wanted to make a graphical tool to do this i not only _wanted_ to make such a tool, i actually _programmed_ it... > If one wanted to make a graphical tool to do this > you would not only need the "Choose A" and "Choose B" options > but "Edit in Context while displaying a copy of the original scanned page" you need to read what i wrote, jim. my tool gives people the option to edit either choice a or choice b, before selecting that particular choice. as for displaying the scan, does your pgdiff tool show that information? because if your tool can show that info, my tool can display the scan... but as far as i can see, from your limited example, you don't show that. (and when i say "my tool" in this post, i mean the tool that i wrote that _supports_ your tool. in the tools that i've programmed for _my_use_, i _always_ retain the page-scan information, so i can _display_ the scan.) > and if one wants to make that kind of tool one would be > better off to put the time and effort into figuring out a tool and here we wander off into the land of unnecessary complexity... > This is not my experience i've shared many of my experiences in doing the comparison method. if your experience doesn't match mine, you should share yours as well. > Post a portable version or one compiled for windows > and I will tell you how it works for me in practice.? i need to have some questions resolved before i put it out in public. that's why i furnished you some sample texts to get pgdiff output on. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Mar 17 14:17:53 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Mar 2010 17:17:53 EDT Subject: [gutvol-d] a rose smells just as sweet Message-ID: <6c883.e6e09c0.38d2a101@aol.com> rfrank has informed people that he doesn't consider his site to be "competition" to the d.p. site... so from now on i'll characterize it instead as "an alternative" to d.p. it's all good. and a rose smells just as sweet... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Mar 17 15:57:09 2010 From: jimad at msn.com (James Adcock) Date: Wed, 17 Mar 2010 15:57:09 -0700 Subject: [gutvol-d] Re: [SPAM] re: New Tool "pgdiff" In-Reply-To: <6311e.21050e3b.38cffc26@aol.com> References: <6311e.21050e3b.38cffc26@aol.com> Message-ID: >>here's the original text uploaded by rfrank for his proofers: >> http://z-m-l.com/go/jimad/sitka0-ocr.txt > >and here's the text after the proofers were done with it: >> http://z-m-l.com/go/jimad/sitka1-pp.txt > >if you can run that through your tool and share its output, >that would be great. OK, I put the output at: http://www.freekindlebooks.org/Dev/StringMatch/BBoutput.txt where I have changed the page separators on the two files to be identically named, because I am assuming you wouldn't want to find all the file name changes. Please note that the problem domain you are applying the tool to is not the same problem domain intended for the tool - so one shouldn't be surprised then if you consider the results in some sense "suboptimal." Even these "simple" outputs however, show how often it really isn't simply a problem of "Choose Word A" or "Choose Word B" but rather there a often lots of other issues involved at the same time, such as whitespace issues, punc issues, line break issues, etc, which complicate the design of the editor interface - assuming one *wants* to design a custom editor. Again, the problem this tool was designed to address was when you have two "independent" OCR outputs and you want to compare them to find those words or sections where a human being needs to perform an edit. Or for versioning. The results after human editing then would be expected to be about the quality of the output of a "P1" pass which then would have to be further carefully checked by more passes. And it is envisioned that even during the "P1" pass the editor is comparing to the page images. When applied to the problem domain envisioned you have at least 2X as many errors to deal with, and the resulting errors are more difficult than the ones in your example input files. Please see at: http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt what I think is a more reasonable example of the kinds problems this tool is designed to address - here being used for versioning - an OCR from one edition of a text is being compared to an existing but old copy of a human-corrected PG text. On this example ideally a smart de-hyphenator ought to be run before making the comparison, but, its still interesting to see what happens when this isn't done. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Mar 17 16:59:07 2010 From: jimad at msn.com (James Adcock) Date: Wed, 17 Mar 2010 16:59:07 -0700 Subject: [gutvol-d] [SPAM] RE: Re: [SPAM] re: New Tool "pgdiff" In-Reply-To: References: <6311e.21050e3b.38cffc26@aol.com> Message-ID: PS RE: http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt Attempting to hand-score the kinds of edits one would need to do on hkdiff.txt, it seems to me like an intelligent editor could present "Choose A" vs. "Choose B" alternatives about 85% of the time, whereas the other 15% of the time a more complicated interface would have to be presented - or else the editor just punts and points to the text and says "You Fix It! (which is basically the approach my current choice of editor takes 100% of the time ;-) However, if the editor gives a "Choose A" vs. "Choose B" interface sometimes the editor (and/or the user) is going to be deceived because what looks like an A/B choice really ISN'T. For example a hypothetical example: .. one { must | MUST } be careful! And the correct answer is neither A nor B but rather C == _must_ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Mar 17 17:13:32 2010 From: jimad at msn.com (James Adcock) Date: Wed, 17 Mar 2010 17:13:32 -0700 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: References: Message-ID: >i do it with tools i've programmed that do all the things that you talk about, and more. that's how i do it, keith. Post your tools, including source code, and then let's talk about it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From danweber at mindspring.com Wed Mar 17 17:23:09 2010 From: danweber at mindspring.com (Dan Weber) Date: Wed, 17 Mar 2010 20:23:09 -0400 Subject: [gutvol-d] Popular Science back issues Message-ID: <003901cac631$30b64830$9222d890$@com> Sorry. The address is http://www.popsci.com/announcements/article/2010-03/new-browse-137-years-pop sci-archive-free I had to grab the jpg files from my temp internet files folder (Win Vista) Hopefully someone else will know a better way to get them They say they are partnered with Google danweber at mindspring.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Mar 17 17:53:34 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Mar 2010 20:53:34 EDT Subject: [gutvol-d] Re: New Tool "pgdiff" Message-ID: <7a50f.2606ca15.38d2d38e@aol.com> jim said: > Post your tools, including source code, and then let?s talk about it. maybe you're too new here to know this, but i don't post my source code. even if i did, it wouldn't do you any good, unless you can use realbasic... likewise, your code doesn't do me any good, because i don't deal with whatever language you've posted it in, so you have done me no favors, and thus i don't feel any need to "reciprocate" for your posted code... but what _might_ do you some good is for me to talk in pseudo-code, if you're interested in hearing that, which i am more than happy to do. but none of this is difficult. especially from a line-based perspective. you read one file into one array, and the second file into a second array, and then compare the two arrays, item-by-item. it ain't rocket-science. the main difficulty in any comparison routine is the re-sync process; but if you work in a line-based way, your lines don't get out-of-sync. (in the rare cases where they do, you can make a manual adjustment.) so with this, the interface is more important than the underlying code. but i'm even willing to post compiled versions of a comparison tool, so that you can get a very good idea about the interface i am using, provided you can get a handful of people -- i.e., 5 people -- to say publicly, right on this listserve, that they would like to see my tool... but i haven't gotten the impression that anyone here can code a g.u.i. for an offline app. i'd love to be wrong about that, so please please do correct me, anyone listening out there, if you can indeed do that... -bowerbird p.s. if you can't get 5 people to say "please", then you should read the design description that keith wrote up last night, as it's decent. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Mar 17 17:56:38 2010 From: jimad at msn.com (James Adcock) Date: Wed, 17 Mar 2010 17:56:38 -0700 Subject: [gutvol-d] Re: any arguments against "free-range" proofing? In-Reply-To: <4B9FFD1A.8000409@verizon.net> References: <36c3a.27b91b5.38c98aa2@aol.com> <4B9FFD1A.8000409@verizon.net> Message-ID: >With the current systems, a volunteer knows that even if they can't do the entire book themselves, someone else will help out and it will get done. This statement is not true, but also to the extent it is true is also be a statement of a problem: DP has many examples of books that volunteer(s) start but which don't get finished. Hence the queuing system and the increasing wait times. However, your thesis is also a statement of a problem: When volunteers start something they assume that *someone else* needs to finish it! In turn these other volunteers may feel an obligation to finish something that someone else has started when a better answer may be to NOT finish it! Certainly in the case of very difficult and time-consuming books that no one wants to read, the right answer may be to NOT finish it. One can easily show other cases that are much more interesting: difficult books that people WOULD want to read if they were finished and yet the right answer might STILL be that it is better off NOT to finish it! [see for example: Bibliotheca Britannica] When I volunteer at DP I often end up asking myself a simple question: Do *I* think that if the person who started this project had to do it all themselves would they do so? If the answer is "NO" then I decide that my efforts are being "freeloaded" upon and I go work on something else! Conversely, one of my proposals for changes at DP is a simple one: if person A starts a book and other volunteers do not want to finish it then at least let person A finish it rather than leaving it stuck on queue "forever"! One simple measure of the "worthiness" of a project is that at least one person in the world wants to finish it. Unfortunately, DP fails even that test! - the current system doesn't even allow a person who *wants* to finish a book the right to do so! At least put in a "time out" system or something where if something gets stuck for a year or more then DP admits they are not going to get it done in a timely manner and put it back up for grabs! >I guess what I'm saying is that people who proof for the sake of proofing like to see progress. To me personally "seeing progress" means seeing something I have worked on posted to PG for others to read. Agreed that means the book needs to get "done." Each spot on a queue for a book to get stuck on is yet another chance for a book to become not-done. >Another issue with a free-range system has to do with abuse. If no one is likely to look again at whatever page I've just done, there is nothing to keep me from changing what it says. Think of it as a kind of graffiti. I have had problems with this on Wikipedia, where one posts science-based answers to science-based questions and then people whose religion or politics conflicts with the science hack the postings. Certainly when someone is proofing something that they find offensive the temptation is always to "edit." -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Wed Mar 17 18:04:10 2010 From: jimad at msn.com (James Adcock) Date: Wed, 17 Mar 2010 18:04:10 -0700 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: <7a50f.2606ca15.38d2d38e@aol.com> References: <7a50f.2606ca15.38d2d38e@aol.com> Message-ID: >but i haven't gotten the impression that anyone here can code a g.u.i. for an offline app. i'd love to be wrong about that, so please please do correct me, anyone listening out there, if you can indeed do that... I will readily admit to NOT being the world?s greatest GUI writer, but that is not the problem. The problem is not having a portable GUI system that I?d be happy to write tools on that work on the variety of machines PG/DP people work on. If you know of a portable GUI library system that you think is really really good let me know ? everything I dig into ends up disappointing me. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Wed Mar 17 18:10:31 2010 From: dakretz at gmail.com (don kretz) Date: Wed, 17 Mar 2010 18:10:31 -0700 Subject: [gutvol-d] Re: New Tool "pgdiff" In-Reply-To: References: <7a50f.2606ca15.38d2d38e@aol.com> Message-ID: <627d59b81003171810g549fa6f6w1e3611bc30101465@mail.gmail.com> I think realistically the only three durable options are 1.) Adobe Flex with the AIR library - fairly portable to W/M/L and their new release has one of the best text-layout libraries I've seen; But also see my earlier comments. 2.) Silverlight - with the obvious microsoft attributes, and 3.) HTML 5 which looks like it may be an option sooner rather than later. On Wed, Mar 17, 2010 at 6:04 PM, James Adcock wrote: > *>*but i haven't gotten the impression that anyone here can code a g.u.i. > for an offline app. i'd love to be wrong about that, so please please > do correct me, anyone listening out there, if you can indeed do that... > > I will readily admit to NOT being the world?s greatest GUI writer, but > that is not the problem. The problem is not having a portable GUI system > that I?d be happy to write tools on that work on the variety of machines > PG/DP people work on. If you know of a portable GUI library system that you > think is really really good let me know ? everything I dig into ends up > disappointing me. > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Mar 17 18:53:53 2010 From: prosfilaes at gmail.com (David Starner) Date: Wed, 17 Mar 2010 21:53:53 -0400 Subject: [gutvol-d] Re: any arguments against "free-range" proofing? In-Reply-To: References: <36c3a.27b91b5.38c98aa2@aol.com> <4B9FFD1A.8000409@verizon.net> Message-ID: <6d99d1fd1003171853m3293107ck90c3a4172ee76979@mail.gmail.com> On Wed, Mar 17, 2010 at 8:56 PM, James Adcock wrote: > Certainly in > the case of very difficult and time-consuming books that no one wants to > read, Unless I've missed something, you've never provided an example of such. You've certainly never shown that they exist in significant numbers at DP. -- Kie ekzistas vivo, ekzistas espero. From vlsimpson at gmail.com Wed Mar 17 20:29:46 2010 From: vlsimpson at gmail.com (V. L. Simpson) Date: Wed, 17 Mar 2010 22:29:46 -0500 Subject: [gutvol-d] Re: To whom it may concern: In-Reply-To: <4BA13358.9020207@telkomsa.net> References: <003301cac45c$af100b20$0d302160$@com> <4BA13358.9020207@telkomsa.net> Message-ID: On Wed, Mar 17, 2010 at 2:54 PM, Jon Richfield wrote: > Dan, yes it concerns me, but I cannot find those scans on that page. Should > I be looking more intelligently? > > Cheers, > > Jon > > On 2010/03/15 18:29 PM, Dan Weber wrote: > > To whom it may concern: > www.popsci.com > This site has 137 years of Popular Science magazine page scans online for > free. I typed archive in the search box on the site and got this: http://www.popsci.com/archives Then Google books advance search, title: popular science; check full view and magazine buttons. From Bowerbird at aol.com Wed Mar 17 21:31:51 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 00:31:51 EDT Subject: [gutvol-d] [SPAM] re: Re: New Tool "pgdiff" Message-ID: <7c687.59e88b2e.38d306b7@aol.com> jim said: > If you know of a portable GUI library system > that you think is really really good let me know i use realbasic. if i didn't think it was really good, i wouldn't use it. it compiles to windows, mac (even back to classic), and linux too... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Mar 17 22:05:57 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 01:05:57 EDT Subject: [gutvol-d] [SPAM] re: New Tool "pgdiff" Message-ID: <520ca.55901046.38d30eb5@aol.com> jim said: > And the correct answer is neither A nor B but rather C == _must_ well, one could easily program the tool to offer an italicized version of an all-upper choice if you know you'll be processing p.g. e-texts. indeed, a button that will italicize both choices is easy enough to code. likewise with _any_ particular editing function that might be required... for instance, i've programmed a routine that checks for a spacey-quote; if it finds a spacey-quote in one of the choices, and the two choices are otherwise identical, it auto-selects the option without the spacey-quote. *** in reviewing the pgdiff output from the sitka files, i wanted to see if you would do much preprocessing on the files. it appears to me you did not. in general, i'd _highly_ recommend preprocessing before a comparison. the number of diffs can be significantly lowered by good preprocessing, and preprocessing is typically a far more efficient way to make changes. it also helps to know about the nature of the files that you're comparing. for instance, one of the sitka files was a post-proofing file, meaning that it was littered with artifacts of the d.p. workflow... these include "notes" the proofers leave for the post-processor. it's far better to handle these "notes" in an editor during preprocessing before you start the comparison. another artifact of the d.p. workflow is asterisks on end-line (and end-page) hyphenates. i typically just delete these asterisks, as i have no use for them. some of these were present in the o.c.r. file too, so i removed them as well... after having deleted all the asterisks associated with "notes" and hyphenates, the only asterisks left in the file were those that indicated _footnotes_ in the o.c.r. file, so i did a monitored global change of them to footnote indicators. that way, these footnote indicators wouldn't present a "spurious" difference... (i could've done a global change to the characters that indicated the second and third footnotes on a page, but i didn't bother, as there weren't too many.) it also helps to know that rfrank marks "questionable" situations with an "@", so you can search for those and deal with those before doing a comparison. oh, and one other _big_ thing. the o.c.r. file had the _pagenumbers_ in it. they were enclosed in brackets, at the bottom of most pages, which is why rfrank's preprocessing program probably didn't find them to delete them... now, those pagenumbers were deleted by the proofers -- except in the 2 cases where the proofers failed to make the deletion -- so they were _not_ present in the second file. so, to avoid the spurious diffs, you could have eliminated them from the o.c.r. file easily, with a series of reg-ex changes. on the other hand, since i _like_ pagenumbers, and want to _keep_ them, i had my tool _inject_ them from the o.c.r. file back into the proofed file... either way, it's best to eliminate as many of these "spurious" diffs as you can. and i note here, jim, that you did eliminate one case of such "spurious" diffs when you reformatted the page-scan references so they would be identical. so i encourage you to take that general idea and run with it... i _will_ talk further about the diffs that were generated anyway. but i wanted to stress the importance of doing preprocessing... *** jim, i looked at hkdiff.txt briefly. i don't know what kind of sense to make of this diff at the end: > {|or|don't|you?-that's|the|idea.|Don't|you|reckon|12} i removed the whitespace so it'd fit on one line. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Wed Mar 17 22:08:31 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 01:08:31 EDT Subject: [gutvol-d] @!@!@!@!@!@!@! Re: [SPAM] re: New Tool "pgdiff" Message-ID: <521b3.74a3f49a.38d30f4f@aol.com> why are these posts coming back with "spam" in the header? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From richfield at telkomsa.net Wed Mar 17 23:52:51 2010 From: richfield at telkomsa.net (Jon Richfield) Date: Thu, 18 Mar 2010 08:52:51 +0200 Subject: [gutvol-d] Re: To whom it may concern: In-Reply-To: References: <003301cac45c$af100b20$0d302160$@com> <4BA13358.9020207@telkomsa.net> Message-ID: <4BA1CDC3.6030408@telkomsa.net> Thanks to Dan and V. L. Simpson. This worked. It is of course just a weeny bit data-greedy for routine work, but it is nice to be able to go there. My respects to Pop-Sci for an intelligent and public-spirited use of a valuable resource, and a vigorous site. A lot of other magazines could do worse than inspect their effort with respect. Cheers, Jon > On Wed, Mar 17, 2010 at 2:54 PM, Jon Richfield wrote: > >> Dan, yes it concerns me, but I cannot find those scans on that page. Should >> I be looking more intelligently? >> >> Cheers, >> >> Jon >> >> On 2010/03/15 18:29 PM, Dan Weber wrote: >> >> To whom it may concern: >> www.popsci.com >> > >> This site has 137 years of Popular Science magazine page scans online for >> free. >> > I typed archive in the search box on the site and got this: > http://www.popsci.com/archives > > Then Google books advance search, title: popular science; check full > view and magazine buttons. > > From schultzk at uni-trier.de Thu Mar 18 00:01:33 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Thu, 18 Mar 2010 08:01:33 +0100 Subject: [gutvol-d] Re: [SPAM] RE: Re: [SPAM] re: New Tool "pgdiff" In-Reply-To: References: <6311e.21050e3b.38cffc26@aol.com> Message-ID: Hi All, It is interesting how most here are in love with their tools. I have notice how statistics and proofs are stated to show what makes THIER tools the best. No, James this is not directly aimed at you, but all the others. I could show you all easily that diff is the wrong tool. It is inefficient been proven since the 60s or was that the 70s. But, who cares. Your example can only be handled by an ABLE PROOFER. Neither your tool nor anybodies elses is better. Come on, peolple! get productive. regards Keith. Am 18.03.2010 um 00:59 schrieb James Adcock: > > Attempting to hand-score the kinds of edits one would need to do on hkdiff.txt, it seems to me like an intelligent editor could present ?Choose A? vs. ?Choose B? alternatives about 85% of the time, whereas the other 15% of the time a more complicated interface would have to be presented ? or else the editor just punts and points to the text and says ?You Fix It! (which is basically the approach my current choice of editor takes 100% of the time ;-) > > However, if the editor gives a ?Choose A? vs. ?Choose B? interface sometimes the editor (and/or the user) is going to be deceived because what looks like an A/B choice really ISN?T. For example a hypothetical example: > > ?. one { must | MUST } be careful! > > And the correct answer is neither A nor B but rather C == _must_ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From gbnewby at pglaf.org Thu Mar 18 07:24:15 2010 From: gbnewby at pglaf.org (Greg Newby) Date: Thu, 18 Mar 2010 07:24:15 -0700 Subject: [gutvol-d] Re: @!@!@!@!@!@!@! Re: [SPAM] re: New Tool "pgdiff" In-Reply-To: <521b3.74a3f49a.38d30f4f@aol.com> References: <521b3.74a3f49a.38d30f4f@aol.com> Message-ID: <20100318142415.GA2676@pglaf.org> On Thu, Mar 18, 2010 at 01:08:31AM -0400, Bowerbird at aol.com wrote: > > why are these posts coming back with "spam" in the header? > > -bowerbird The pglaf mailer includes a few spam filters, one of which adds the [SPAM] string to the subject header. Responders should edit them out, not propagate them. -- Greg From Bowerbird at aol.com Thu Mar 18 08:29:56 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 11:29:56 EDT Subject: [gutvol-d] i love my tools! because the proof is in the pudding! Message-ID: <7e817.4149442d.38d3a0f4@aol.com> guilty as charged, keith! i _do_ love my tools! all of them! most especially the ones that i coded myself, since i put a lot of blood, sweat, and tears into each one! but even the ones that other people made, i love those too! because they make my life easier for me! and i love "easier"! and it's not that stupid kind of blind love, either. no sir. because -- don't forget it! -- the proof is in the pudding! i can tell you exactly why i love each of my tools, and those reasons are good, solid reasons with sound, logical backing. i analyze my needs, and my tools, extremely closely so that i _know_ -- with _certainty_ -- just exactly what i need and why i need it, and then i make sure that my tools deliver it, whether they were programmed by someone else (which i strongly prefer, because i _did_ mention that i love "easier") or programmed by me (due to the blood, sweat, and tears). i'm happy to share my analyses of my needs, and my tools, too, because that makes all of us smarter about all of that, and i'm happy when other people share _their_ analyses too. but yes yes yes yes yes yes yes, i _do_ love my tools! i do! and it's easy as pie to know why! the proof is in the pudding! -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 18 10:14:39 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 13:14:39 EDT Subject: [gutvol-d] jim, i have some questions about pgdiff output Message-ID: <4e132.3aad1a5d.38d3b97f@aol.com> jim, here are 55 cases where your tool seems to give us more than just the 2 choices that i would expect to see... in some cases, such as the second one listed (hot springs), it's because one of the proofer's notes contained a "|" in it. you'll want to screen the input for your significant characters, i.e., any "{" or "}" or "|", and eliminate them to avoid confusion. i have some other questions as well, but let's see if the fix for this issue provides answers for those questions as well... -bowerbird > { Seattle, | > Seattle, Washington | > Washington##} [Illustration]# > Hot { Springs;[**probable printer's error. Should be ,|next: no, ; is fine] | > Springs; } each with its individual charm;# > connect the Sitka of the past, the { Novo | > Novo Ark-*angelsk | > Arkangelsk#} of the great Russian American Company# > and their fate will never be known to a { certainty.##[Footnote | > cer-taintv.##A: | > * } January { 20th.[**,?] | > 20th. 1820, | > 1820. } a letter written by the Directory at St.# > { ====sitka0-015.txt===###[Illustration: Copyright by E. W. Merrill, | > [10]####Sitka. | > ====sitka0-015.txt===###Mount | > O#Edgecumbe.] | > H####} ====sitka0-016.txt===# > He named the mountain { San | > San Jacinthus, | > Jacinthus,}# > toward the sea, Cape { del | > del Engano. | > Engano. } No one who# > 2,000 skins of the { Morski | > Morski bobrov, | > 'bo'brov, } as they# > or { Kolosh | > Kolosh Ryeku. | > Ryeku.##} On the morning of September 28th the# > { ====sitka0-030.txt===##[Illustration: Sitka in | > [24]####1805--From | > ====sitka0-030.txt===##Lisianski's | > [Blank Voyage.] | > Page]####} ====sitka0-031.txt===# > rose the town of New Archangel { (Novo | > (Novo#Arkangelsk,) | > Arkangelsk,) } and on the kekoor was built a# > valued at 450,000 { rubles.[B]##[Footnote | > rubles.f##A: | > * } The livestock taken to Sitka in 1804 consisted of "Four# > p. { 218.)]##[Footnote | > 218.)##B: | > t } Lisianski made the surveys and named the islands of the# > { yourts, | > yourts, } in which live the { kayours | > kayours } and the# > { [Footnote A: | > * } The Russian sazhen is { 7 | > I feet.] | > feet,# > { [Footnote A: | > * } These books and letters were brought by Resanof { in | > In } the# > theft in the years when there was no custodian of such { property.]###====sitka0-043.txt===###[Illustration: The Bakery and Shops of the Russians--Later the | > property.##Sitka | > [36]####Trading | > ====sitka0-043.txt===##Co.'s | > [Blank Building.] | > Page]####} ====sitka0-044.txt===# > the { dushnoi dereva | > dushnoi or | > dereva scented | > or., at scented } wood of the# > Place of Islands { (Chasti | > (Chasti Ostrova) | > Ostrova) } is reputed# > { [Footnote | > 103.##A: | > * } Golofnin, Voyage of the Sloop "Kamchatka," in Mat. { Pt. | > Ft. } 4, p.# > { Wrangel's | > Wraiigel's } daughter--Mary." There { is | > Is } also { to | > t-> } be found: "Died,# > the church by a partition called the { Ikonastas, | > Ikonastas,#} which is ornamented with twelve { ikons, | > ikons, } or# > { repousse | > repousse } work in the true { Russian | > Eussian } style of# > { ====sitka0-066.txt=== | > [56]####[Illustration: | > ====sitka0-066.txt===###} The { Madonna.] | > Madonna.# > { ====sitka0-071.txt===###[Illustration: | > [60]####/* | > ====sitka0-071.txt===###} The Baranof Castle.# > The { U. | > IT. S[**.] | > S } Agricultural Department { building occupies | > building-occupies } the site at the# > { [Footnote A: | > * Narative[**Narrative?] | > Narative } of a Voyage Round the { World, | > World. } 1836-1842, by Captain# > Sir Edward Belcher, Vol. { 1,[**I?] | > I, } pages { 95 | > 05 } et { sen.# > { ober off | > ober offitzer | > User } who sought her hand in marriage.# > dead in one of the small drawing { rooms."[A]##[Footnote | > rooms."*##A: | > * } Frederick { Schwatka, | > Sohwatka, } the explorer, seems to have been one of# > 24th, 1896, and the time is fixed as being in the administration { of]* | > of##====sitka0-074.txt=== | > [62]####[Illustration: | > ====sitka0-074.txt===###} The Grave of the Princess { Maksoutoff.] | > Maisoutoff.# > martin from the Yukon, others { en | > en route | > route } to# > reason for their living on this distant { shore.[A]##[Footnote | > shore.*##A: | > * } Between 1821 and 1862 there were shipped by the Russian# > (Washington, Government Printing { Office).]###====sitka0-079.txt===##[Illustration: Sitka in 1860, Near the Close | > Office).##of | > [66]####the | > ====sitka0-079.txt===###Russian | > CD#Administration.] | > CO####} ====sitka0-080.txt===# > for calico and beads, blankets and { ammunition.[A] | > ammunition.*#} This market was closed by a { portcullised | > portcul-lised# > quids; fish priced according to { size[** | > size ;?] | > ? } all according to price list established# > B: | > t } Golobokoe Lake was sounded to a depth { of | > cf } 190 fathoms# > Ivan { Vasilivich | > Vasiiivich Furuhelm, | > Funihelm, } June 22, 1859, to Dec. 2, 1863.# > { ====sitka0-090.txt===###[Illustration: Sitka in 1869--During the Time of the Military | > [76]####Occupation.] | > ====sitka0-090.txt===#####} ====sitka0-091.txt===# > in the land that had so long been their { home.[C] | > home.t#} Among those who remained are the { Kashavaroffs, | > Kashavar-offs,# > { [Footnote A: | > * } The Russian soldiery were dressed { in | > In } a dark uniform, trimmed# > it down on the bayonets of the Russian { soldiery.]##[Footnote | > soldiery.##C: | > t } On December 14, { 1807, | > 1807. } the Russian ship "Czaritza," sailed for# > Russia, via London, with { 168 | > 368 } passengers. January { 1, | > I, } 1868, the# > Ex. Doc. H. R. 41st Cong. 2nd { Ses., | > Ses.. } p. 1030; Seattle { Intelligencer, | > intelligencer,# > 123; citizens by treaty, 229. { Total, | > Total,.444. 444. | > ? } Beardslee's Report, 47th# > Erussard, Ed. Doyle, George E. Pilz, Nicholas Haley, John { McKenna, | > MCKenna,#Reub | > Keub } Albertson, John Olds and { others.# > One of the traders of the town, { Caplin, said: | > Caplin,-said:#} "De Captain may go { to ---- | > to---wid wid | > his his | > tarn# > { ====sitka0-107.txt===##[**The CP failed to rotate this page correctly.][**Seems to be fixed now :-)]#[Illustration: Sitka--East on Lincoln Street--the Governor's Walk | > [92]####of | > ====sitka0-107.txt===##the | > [Blank Russians.] | > Page]####} ====sitka0-108.txt===# > { ====sitka0-110.txt===##[Illustration: Interior of Cathedral | > [94]####of | > ====sitka0-110.txt===##St. | > [Blank Michael] | > Page]####} ====sitka0-111.txt===# > { [Footnote A: | > ? } The first church in Alaska was built at { Kodiak | > Kodlak } (Paulovski) in# > towers above the bay to the height of { 3,216 | > 3,21.6#} feet. Along the river, known as the { Kolosh | > Kolosh# > is prominent the Devil's Club { (panax | > (panax horridus), | > horrid-us},# > { ====sitka0-117.txt=== | > [100]####[Illustration: | > ====sitka0-117.txt===###} Russian { Blockhouse.] | > Blockhouse.# > drew their stores of { krasnia | > Jcrasnia ruiba | > ruiba } (the red# > the trough of the watering place of { the | > the" "Jamestown," | > Jamestown,"#} came to the beach. This place may be# -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Thu Mar 18 12:30:32 2010 From: jimad at msn.com (Jim Adcock) Date: Thu, 18 Mar 2010 12:30:32 -0700 Subject: [gutvol-d] Re: any arguments against "free-range" proofing? In-Reply-To: <6d99d1fd1003171853m3293107ck90c3a4172ee76979@mail.gmail.com> References: <36c3a.27b91b5.38c98aa2@aol.com> <4B9FFD1A.8000409@verizon.net> <6d99d1fd1003171853m3293107ck90c3a4172ee76979@mail.gmail.com> Message-ID: >Unless I've missed something, you've never provided an example of such. You've certainly never shown that they exist in significant numbers at DP. Unless I've missed something, PG doesn't publish download numbers on anything other than the most popular books. However, TIA does publish download numbers which one can use as proxy: 2,583,382 Downloads of the Most Popular PG Book 8 Downloads of the Least Popular PG Book Bang-for-the-Effort Ratio of Over 300,000 to 1. You can query this yourself using the TIA "Advanced Search" option on "collection:gutenberg" fields to return = downloads + title HTML table Sort Results by: either downloads desc or downloads acs But one should be forewarned that it does not appear to me that patterns of downloads from TIA is identical to pattern of downloads directly from PG -- TIA users are more sophisticated users aka nerdy than PG direct users. Personally I would rather work on a book that is towards the 2,500,000 download end of the spectrum than on the 10 downloads end of the spectrum! Again, there are literally about 1,000 more books out there that can be saved than we have the time and effort to save. The question then becomes, which books do we save? If one is doing the entire job oneself then the answer is easy: That book which you are willing to work on. If one is picking a book and imposing the work on other volunteers then the question becomes who should have the right to make that decision and how? "First come first serve" I suggest is a horrible way to make this choice because it encourages the most greedy and inconsiderate submitters to get there first rather than to take a thoughtful approach to picking which books to save and then doing a really really good job of digitizing and OCR'ing them. From Bowerbird at aol.com Thu Mar 18 12:44:55 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 15:44:55 EDT Subject: [gutvol-d] Re: =?iso-8859-1?q?=40!=40!=40!=40!=40!=40!=40!_Re=3A=A0_=5BSPAM?= =?iso-8859-1?q?=5D_re=3A_New_Tool_=22pgdiff=22?= Message-ID: <9eacc.665e7552.38d3dcb7@aol.com> greg said: > The pglaf mailer includes a few spam filters, one > of which adds the [SPAM] string to the subject header. why? and for what purpose? which particular "filter" is it that is doing this? can it be deactivated? if not, can you tell us how to avoid setting it off? because this "filter" is doing nothing but emitting false alarms. and it's not stopping any spam, is it? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Thu Mar 18 15:16:42 2010 From: jimad at msn.com (James Adcock) Date: Thu, 18 Mar 2010 15:16:42 -0700 Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output Message-ID: >jim, here are 55 cases where your tool seems to give us more than just the 2 choices that i would expect to see... See my discussion below of what the Levenshtein Distance is and how pgdiff implements it. >in some cases, such as the second one listed (hot springs), it's because one of the proofer's notes contained a "|" in it. you'll want to screen the input for your significant characters, i.e., any "{" or "}" or "|", and eliminate them to avoid confusion. Agreed that this would be a problem if my tool is used as input to another "smart editor" tool that wants to present "Choose A" vs. "Choose B" type choices. Since instead the tool was targeting a regex editor being driven by a real human being who can recognize from context whether the "{|}" chars are being used to highlight differences vs. being used as part of the input text it hasn't been a problem for me re the intended problem domain. ====== Levenshtein Distance is the measure of the number of changes needed to transform one string of tokens into a different string of tokens, where the allowable edits are "insert", "delete" or "substitute." Different implementations of the algorithm would have different interpretation of what constitutes a "token" and what constitutes a "string". One obvious interpretation would be that a "token" is an ascii char and a string is a line of text (dictionary lookups of miss-spelled words) Another obvious interpretation is a "token" is a line of text and the string is the list of lines of text within a file (diff) pgdiff implements neither of these but rather a "token" to be a "word" where a "word" is a non-white sequence of chars followed by a white sequence of chars, where the white sequence of chars is considered not-significant for the purposes of the Levenshtein Distance, but IS significant for the display of output. pgdiff considers the "string" to be the entire list of words in the input file. The typical importance of the white part is whether words are separated by a space or by a linebreak. Pgdiff doesn't care about the white part in terms of the Levenshtein Distance, so that the two input files can have different line lengths and different linebreak locations, and still be comparable. This also means that typically including page break information in the input files such as the "====== filename.101 ====" type stuff would NOT be a good idea, since typically the input files may have their page breaks in different locations re their word content -- unless the two input files are from the same identical edition. So here's some answers to some implied questions or assumptions: Does pgdiff look for word differences within a line of text? No. Does pgdiff look for single word changes? No. OK, what does pgdiff do? What pgdiff does is to calculate a best match of words across two entire files. Assuming you set the input options large enough, for example, one input file could contain an entire chapter that the other input file doesn't contain and the algorithm would sync up just fine. Or in the case of a book I've worked on previously the US version had paragraphs removed by a censor, whereas the European version of the text had them intact. When the words do not match exactly, the mismatches are categorized three ways 1) Insert this missing word. 2) Delete this extraneous word. Or 3) Substitute this one word for a different word. Now by reversing the input order options 1) and 2) obviously become symmetrical -- an insertion in one case becomes a deletion in the other case. So in either case an isolated word difference is displayed like { this } or if a bunch of words in a row are delete or insert like { this is in one text but not the other } In case 3) if only one word is different in a row it displays the output choice like { this | that } But in case three if a bunch of words are different in a row how to display them? If the differences are due to scannos it is probably best to display the words next to each other { this | th*s is | iz a | u test | tost } whereas if the differences are due to human editing it would probably be best to display them as "sentences" { THIS IS A TEST | _this is a test_ } If you are implementing a "smart editor" then clearly you can choose to display them which way you want. In practice what one normally sees is some weird mixture of the two possible situations, and it isn't clear to me which display technique is best, so so far I have chosen the easiest approach to implement -- which is the first pattern of display { this | th*s is | iz a | u test | tost } >From the BBoutput.txt file, for example, consider: { Seattle, | Seattle, Washington | Washington } Which is of the first pattern. The ending } is on a newline since the two tokens differing in whitespace, space vs. linebreak. Taking that diff back out one gets: { Seattle, | Seattle, Washington | Washington } Which one can read as: Choose one of: Seattle, OR Seattle, Followed by: Choose one of: Washington OR Washington In this case if one KNEW the differences are due to humans rather than scannos , then it is "obvious" that the better display pattern would be the second one: { Seattle, Washington | Seattle, Washington } IE Choose one of: "Seattle, Washington" OR "Seattle, Washington" But in general the tool doesn't know if differences are due to human edits or scannos, and in general what one sees is a mixture of both problems happening at the same time. PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files are huge Levenshtein is an n^2 algorithm in space and time. What it does do is break a file into large overlapping chunks of text and calculate the measure across the chunks, where the size of the chunks can be specified as in input parm if you prefer, and where the chunks get sewn back together using an invariant of choosing places in the match where words DO match, and checking the sanity of that match to make sure we haven't lost sync. What this means in practice is that if you specify a parm of -10000 as an input setting then the algorithm can "ONLY" handle about 10000 word mismatches adjacent to each other in a row without erroring out. This parm in practice is important for versioning where two editions of a book have large chunks of text which don't match each other. IE a chapter is edited out or edited in or a censor has taken their knife to the text. Common problems are that two texts from different editions have entire book prefixes (introductions) or entire book suffixes (postscripts or indexes) which don't match -- which one is better to explicitly remove and deal with separately, but which the algorithm will try to handle if you set the input parm large enough. From Bowerbird at aol.com Thu Mar 18 15:22:31 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 18:22:31 EDT Subject: [gutvol-d] i'm a busy little bee Message-ID: <8a6e1.155e416c.38d401a7@aol.com> i'm a busy little bee these days... let's see... *** i'm waiting on a response from jim so i can continue to analyze his tool, and work on my support-application... *** i grabbed the "frankenstein" content from lee passey's site, and have mounted my own version of the book over here: > http://z-m-l.com/go/fpass/dofpass.pl the perl script lets you step through the pages of the book. the script shows lee's .html files as they were when grabbed. i don't wanna mess with the complications of an .html editor, so i won't be bothering to offer an edit capability on the text, at least not quite yet... i mention this because editing is one of the thorny aspects of using .html as your saved-text format. lee is using the "kupu" html-editor. too much trouble for me, and likely far too many cross-browser inconsistencies as well. but those are lee's problems to solve, not mine. good luck, lee. *** here's a little script that tells you if a word you enter is present in the dictionary that i use. i'm not sure why you'd want to know such information, but the script was developed in support of my spiffy spellchecking feature, so it's there if you _do_ have a need: > http://z-m-l.com/go/dict17577.pl *** i'm finishing up a long post on intelligent filenaming. yes, again. but there's a good angle, important enough to warrant exposition. the angle, to steal my own thunder, is that you can easily use the pagenumber information that's contained in the o.c.r. files to rename your .txt and .png files in a more-intelligent manner. if the pagenumber in file "011.txt" says that it's pagenumber 7, you'll rename "011.txt" and "011.png" to "007.txt" and "007.png". easy as pie! *** and finally, i'm still coding my online proofing system. it's grand. i love my tools... anyway, i'll probably unveil the thing next week. yeah, yeah, i know, you can hardly wait, you're so excited... ;+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 18 15:50:25 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 18:50:25 EDT Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output Message-ID: <8c234.435b34c3.38d40831@aol.com> jim said: > In practice what one normally sees is some weird mixture of > the two possible situations, and it isn't clear to me > which display technique is best, so so far I have > chosen the easiest approach to implement in view of the frank admission, let me make some suggestions. i believe these would make your tool's output more workable for the end-user who has to resolve the diffs, no matter _what_ method they use, including the reg-ex editor you use yourself. *** let me pull out the last 2 of the 55 anomalies i posted for you... *** here's the first: > drew their stores of { krasnia | > Jcrasnia ruiba | > ruiba } (the red some people might prefer the version as it was in your file: > stores of { krasnia | Jcrasnia ruiba | ruiba } (the red rather than showing this as a single diff, i'd present it as two... the first would be: > { krasnia | Jcrasnia }. the second would be > { ruiba | ruiba } *** here's the second example: > the trough of the watering place of { the | > the" "Jamestown," | > Jamestown,"} came to the beach. This place may be or, more in keeping with how it's displayed in your output: > place of { the | the" "Jamestown," | Jamestown,"} came to again, i would present this as two diffs... the first would be: > { the | the" } the second would be: > { "Jamestown," | Jamestown,"} *** in both of these examples, i think combining the 2 diffs into one bracket-bound item confuses the item unnecessarily, and confuses the end-user in the process, making the resolution much more difficult than it needs to be... in many of these "multiple diff" brackets, i could have my tool pull apart the various diffs, and display them appropriately... so, you know, if you think the output you are showing now is done the way you _want_ to have it done, that's your decision. but i think it will be more clear if you did it slightly differently. *** another confusion i had with your output was that there were several bracketed items that contained some separator-lines... since all of those separator-lines were standaridized by you before you ran your pgdiff, it seems to me that none of them should've been included in any of the brackets. should they? if you start bringing non-diff material into the edit process, you're asking for problems, it would seem to me, so i would rework that code to try to avoid such problems if i were you. anyway, just a few suggestions, hopefully helpful ones... :+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From vze3rknp at verizon.net Thu Mar 18 17:04:01 2010 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Thu, 18 Mar 2010 20:04:01 -0400 Subject: [gutvol-d] Re: i'm a busy little bee In-Reply-To: <8a6e1.155e416c.38d401a7@aol.com> References: <8a6e1.155e416c.38d401a7@aol.com> Message-ID: <4BA2BF71.7060101@verizon.net> On 3/18/2010 6:22 PM, Bowerbird at aol.com wrote: > the angle, to steal my own thunder, is that you can easily use > the pagenumber information that's contained in the o.c.r. files > to rename your .txt and .png files in a more-intelligent manner. > > if the pagenumber in file "011.txt" says that it's pagenumber 7, > you'll rename "011.txt" and "011.png" to "007.txt" and "007.png". > > easy as pie! Only as easy as pie if the OCR got the page number correct. In my experience, page numbers are very frequently misread. JulietS -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 18 18:00:29 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 21:00:29 EDT Subject: [gutvol-d] Re: i'm a busy little bee Message-ID: <7a7aa.5a7091d9.38d426ad@aol.com> juliet said: > Only as easy as pie if the OCR got the page number correct. > In my experience, page numbers are very frequently misread. i'm getting ahead of myself, but yes, you must check them first. in the sample book which i'll be talking about, 4 pagenumbers were misrecognized in 108 pages, and 1 was entirely missing. and actually, the misrecognitions were on the left-bracket that preceded the pagenumber, rather than the pagenumber per se. but yes, it is true this book had atypically accurate recognition, and that pagenumbers are not infrequently misrecognized... however, it's also true there is a _huge_ amount of redundancy in pagenumbers -- they march on in a predictable sequence -- so routines can be written (and i have written a few of them) to "fill in" the missing numbers in an astonishingly accurate way... even the unnumbered pages stick out in a fairly distinctive way, since the numbering-sequence politely "steps around" them... but why don't we hold off further dialog until i do my full post? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 18 18:46:31 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 21:46:31 EDT Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output Message-ID: <7d330.675f41a8.38d43177@aol.com> for the sake of comparison, here's how i display the sitka diffs: > http://z-m-l.com/go/jimad/sitka-175diffs.html -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Thu Mar 18 20:08:13 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Mar 2010 23:08:13 EDT Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output Message-ID: <818ff.2c290d3d.38d4449d@aol.com> and jim, it doesn't look like 5 people want to see my comparison tool, so you'll have to settle for a screenshot (with some fancy stuff deleted): > http://z-m-l.com/go/jimad/comparison-screenshot.png -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Mar 18 21:01:50 2010 From: prosfilaes at gmail.com (David Starner) Date: Fri, 19 Mar 2010 00:01:50 -0400 Subject: [gutvol-d] Re: any arguments against "free-range" proofing? In-Reply-To: References: <36c3a.27b91b5.38c98aa2@aol.com> <4B9FFD1A.8000409@verizon.net> <6d99d1fd1003171853m3293107ck90c3a4172ee76979@mail.gmail.com> Message-ID: <6d99d1fd1003182101l5883e092wf3b8e5584f818862@mail.gmail.com> On Thu, Mar 18, 2010 at 3:30 PM, Jim Adcock wrote: > Personally I would rather work on a book that is towards the 2,500,000 > download end of the spectrum than on the 10 downloads end of the spectrum! Not something I really see from what you've uploaded to PG, but okay. I'm not sure I agree though; getting something unique online or something higher-quality then can be found elsewhere, is more important to me then something there's a dozen copies of on the web. >?"First > come first serve" I suggest is a horrible way to make this choice because it > encourages the most greedy and inconsiderate submitters to get there first > rather than to take a thoughtful approach to picking which books to save and > then doing a really really good job of digitizing and OCR'ing them. I'm sure we could have told all the Slashdotters to hold on while we were preparing material for them. We might have actually done 40 or 50 books by now that way. I'm sure it also would have helped to criticize our submitters as "greedy and inconsiderate". I'm sure most people who scanned books for DP never thought about the value of the book they were scanning. -- Kie ekzistas vivo, ekzistas espero. From schultzk at uni-trier.de Fri Mar 19 01:54:46 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Fri, 19 Mar 2010 09:54:46 +0100 Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output In-Reply-To: References: Message-ID: <72F0393F-49CB-428D-9E64-6E752997D720@uni-trier.de> Hi, The more i think about the tools discussed here and the use of "diff"s I get the feeling that the use of diff is actually overkill. diff is basically n^2. It was developed when text/string processing was not efficient. Designed for revisioning and compression. It works best for frequent and large differences. Furthermore, it aides in analysis. Proofing is per se linear, has relatively few differences, and is aided by humans, and a new version is to be created and not a merge. The process is simple compare text A and B as long as they are equal and then gather the information as long as the differ, present the difference, offer possible changes, continue. Without much analysis one can see that this process is linear. So maybe a more direct approach could be viable. Of course, other problems of the collaboration have to dealt with elsewhere. O.K. this approach may seem simplistic and primitive, yet it solves a few problems. 1) equality and proofing are done in one pass 2) works with files of any size 3) works with text divided among several files 4) can be easily integrated into different editor modals 5) presentation of the two versions is part of the tool and not dependent on other EXTERNAL representations 6) the processing of metadata and formatting is controlled by the proofing/editor tool. No more worry about pollution for the external diff-tool Cavets: a) you would need a logging system for changes b) higher storage requirements for the entire system c) would have to be programmed from start d) highly adjustable. regards Keith. Am 18.03.2010 um 23:16 schrieb James Adcock: >> jim, here are 55 cases where your tool seems to give us more than just the > 2 choices that i would expect to see... > > See my discussion below of what the Levenshtein Distance is and how pgdiff > implements it. > >> in some cases, such as the second one listed (hot springs), it's because > one of the proofer's notes contained a "|" in it. > you'll want to screen the input for your significant characters, i.e., any > "{" or "}" or "|", and eliminate them to avoid confusion. > > Agreed that this would be a problem if my tool is used as input to another > "smart editor" tool that wants to present "Choose A" vs. "Choose B" type > choices. > > Since instead the tool was targeting a regex editor being driven by a real > human being who can recognize from context whether the "{|}" chars are being > used to highlight differences vs. being used as part of the input text it > hasn't been a problem for me re the intended problem domain. > > ====== > > Levenshtein Distance is the measure of the number of changes needed to > transform one string of tokens into a different string of tokens, where the > allowable edits are "insert", "delete" or "substitute." Different > implementations of the algorithm would have different interpretation of what > constitutes a "token" and what constitutes a "string". One obvious > interpretation would be that a "token" is an ascii char and a string is a > line of text (dictionary lookups of miss-spelled words) Another obvious > interpretation is a "token" is a line of text and the string is the list of > lines of text within a file (diff) pgdiff implements neither of these but > rather a "token" to be a "word" where a "word" is a non-white sequence of > chars followed by a white sequence of chars, where the white sequence of > chars is considered not-significant for the purposes of the Levenshtein > Distance, but IS significant for the display of output. pgdiff considers > the "string" to be the entire list of words in the input file. The typical > importance of the white part is whether words are separated by a space or by > a linebreak. Pgdiff doesn't care about the white part in terms of the > Levenshtein Distance, so that the two input files can have different line > lengths and different linebreak locations, and still be comparable. This > also means that typically including page break information in the input > files such as the "====== filename.101 ====" type stuff would NOT be a good > idea, since typically the input files may have their page breaks in > different locations re their word content -- unless the two input files are > from the same identical edition. > > So here's some answers to some implied questions or assumptions: > > Does pgdiff look for word differences within a line of text? No. > > Does pgdiff look for single word changes? No. > > OK, what does pgdiff do? > > What pgdiff does is to calculate a best match of words across two entire > files. Assuming you set the input options large enough, for example, one > input file could contain an entire chapter that the other input file doesn't > contain and the algorithm would sync up just fine. Or in the case of a book > I've worked on previously the US version had paragraphs removed by a censor, > whereas the European version of the text had them intact. When the words do > not match exactly, the mismatches are categorized three ways 1) Insert this > missing word. 2) Delete this extraneous word. Or 3) Substitute this one word > for a different word. Now by reversing the input order options 1) and 2) > obviously become symmetrical -- an insertion in one case becomes a deletion > in the other case. So in either case an isolated word difference is > displayed like { this } or if a bunch of words in a row are delete or insert > like { this is in one text but not the other } In case 3) if only one word > is different in a row it displays the output choice like { this | that } But > in case three if a bunch of words are different in a row how to display > them? If the differences are due to scannos it is probably best to display > the words next to each other { this | th*s is | iz a | u test | tost } > whereas if the differences are due to human editing it would probably be > best to display them as "sentences" { THIS IS A TEST | _this is a test_ } > If you are implementing a "smart editor" then clearly you can choose to > display them which way you want. In practice what one normally sees is some > weird mixture of the two possible situations, and it isn't clear to me which > display technique is best, so so far I have chosen the easiest approach to > implement -- which is the first pattern of display { this | th*s is | iz a | > u test | tost } > >> From the BBoutput.txt file, for example, consider: > > { Seattle, | Seattle, Washington | Washington > > } > > Which is of the first pattern. The ending } is on a newline since the two > tokens differing in whitespace, space vs. linebreak. Taking that diff back > out one gets: > > { Seattle, | Seattle, Washington | Washington } > > Which one can read as: > > Choose one of: Seattle, OR Seattle, > > Followed by: > > Choose one of: Washington OR Washington > > In this case if one KNEW the differences are due to humans rather than > scannos , then it is "obvious" that the better display pattern would be the > second one: > > { Seattle, Washington | Seattle, Washington } > > IE Choose one of: "Seattle, Washington" OR "Seattle, Washington" > > But in general the tool doesn't know if differences are due to human edits > or scannos, and in general what one sees is a mixture of both problems > happening at the same time. > > PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files > are huge Levenshtein is an n^2 algorithm in space and time. What it does do > is break a file into large overlapping chunks of text and calculate the > measure across the chunks, where the size of the chunks can be specified as > in input parm if you prefer, and where the chunks get sewn back together > using an invariant of choosing places in the match where words DO match, and > checking the sanity of that match to make sure we haven't lost sync. What > this means in practice is that if you specify a parm of -10000 as an input > setting then the algorithm can "ONLY" handle about 10000 word mismatches > adjacent to each other in a row without erroring out. This parm in practice > is important for versioning where two editions of a book have large chunks > of text which don't match each other. IE a chapter is edited out or > edited in or a censor has taken their knife to the text. Common problems are > that two texts from different editions have entire book prefixes > (introductions) or entire book suffixes (postscripts or indexes) which don't > match -- which one is better to explicitly remove and deal with separately, > but which the algorithm will try to handle if you set the input parm large > enough. > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From Bowerbird at aol.com Fri Mar 19 05:57:32 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Mar 2010 08:57:32 EDT Subject: [gutvol-d] save those pagenumber references Message-ID: <6840.ffe3dd8.38d4cebc@aol.com> ok, on the "good news" front, it appears that rfrank has finally decided to start naming his files more wisely, so big respect to the people who steered in that direction. there seemed to be some uncertainty from roger about how to go about coding apps with those new filenames, so i'll talk a little bit about that and hope it filters back... but the initial info can be used by other people as well! sure, if you're scanning your own books, you can name the files intelligently from the get-go, and never worry. (but, um, if you _are_ scanning your own books, please ask me for advice on filenaming, and don't just do what d.p. did when they tried to implement smart filenames, because they got some of the "details" badly mangled.) but sometimes, from other people, you might get files which were named badly, and you'll have to rename 'em. even some of the big scanning projects -- umichigan and the internet archive and google (well, not so much google, not any more, they wised up pretty quickly) -- have been known to adopt some fairly stupid filenaming conventions, so if you use their stuff, you'll have to clean up their mess. so it behooves you to know how. first things first: get yourself "twisted", the dkretz program. > http://code.google.com/p/dp50/downloads/list the initial impetus for this program was precisely this task of renaming files intelligently, and it works very well for it. so that's really all you need. but i'll tell you a bit more... let's say you're doing preprocessing. one of the things that d.p. does is it strips the pagenumbers out of the .txt files... that is just asinine! do not do that, folks. that is the info that you _need_, so -- obviously -- do _not_ throw it away! rfrank discards the pagenumber info from his .txt files too. sometimes, though, for some books, the pagenumber info sidesteps deletion. one such book was the "sitka" one that jim and i have been working on. you can find the file here: > http://z-m-l.com/go/jimad/sitka0-ocr.txt you can see, at the bottom of each page, the pagenumber, enclosed in brackets. and oh what a lovely sight they are! because they tell exactly what the file _should_ be named! for instance, go down to the start of chapter 1. you will see that it occurs in the file rfrank named "011.txt". but, as shown by the pagenumber at the bottom, it's page 7, and _should_ be named "007.txt" or (better) "sitkap007.txt". (in case you're wondering why chapter 1 starts on page 7, it's because the _foreword_ starts on page 1, and runs to page 5. page 6 is a blank verso that is opposite chapter 1.) so we know the file "011.txt" should be "sitkap007.txt". great! but remember the another wrinkle too -- the pagescan filename. so if we know that "011.txt" should be named "sitkap007.txt", we also know that "011.png" should be named "sitkap011.png". now we're cooking... *** so, to find out the pagenumbers in each of the text-files, you can run a little perl program i've put up on the site: > http://z-m-l.com/go/jimad/doglobal.pl that program is a simple "find" program that pulls out any line with the string ".txt" in it, or a right-bracket (i.e., "]"), as shown: sitka0-ocr-001.txt -- [Illustration][**fine print verified by CP] sitka0-ocr-002.txt -- sitka0-ocr-003.txt -- sitka0-ocr-004.txt -- [Illustration: Lovers' Lane, Sitka.] sitka0-ocr-005.txt -- sitka0-ocr-006.txt -- sitka0-ocr-007.txt -- [3] sitka0-ocr-008.txt -- [4] sitka0-ocr-009.txt -- [5] sitka0-ocr-010.txt -- [Blank Page] sitka0-ocr-011.txt -- [7] sitka0-ocr-012.txt -- [8] sitka0-ocr-013.txt -- [9] sitka0-ocr-014.txt -- [10] sitka0-ocr-015.txt -- sitka0-ocr-016.txt -- [11] sitka0-ocr-017.txt -- 112] sitka0-ocr-018.txt -- [13] sitka0-ocr-019.txt -- [14] sitka0-ocr-020.txt -- [15] sitka0-ocr-021.txt -- [16] sitka0-ocr-022.txt -- [17] sitka0-ocr-023.txt -- [18] sitka0-ocr-024.txt -- [19] sitka0-ocr-025.txt -- 120] sitka0-ocr-026.txt -- [21] sitka0-ocr-027.txt -- [22] sitka0-ocr-028.txt -- [23] sitka0-ocr-029.txt -- [24] sitka0-ocr-030.txt -- [Blank Page] sitka0-ocr-031.txt -- [25] sitka0-ocr-032.txt -- [26] sitka0-ocr-033.txt -- [27] sitka0-ocr-034.txt -- [28] sitka0-ocr-035.txt -- [29] sitka0-ocr-036.txt -- [30] sitka0-ocr-037.txt -- [31] sitka0-ocr-038.txt -- [32] sitka0-ocr-039.txt -- [33] sitka0-ocr-040.txt -- [34] sitka0-ocr-041.txt -- [35] sitka0-ocr-042.txt -- [36] sitka0-ocr-043.txt -- [Blank Page] sitka0-ocr-044.txt -- [37] sitka0-ocr-045.txt -- [38] sitka0-ocr-046.txt -- [39] sitka0-ocr-047.txt -- [40] sitka0-ocr-048.txt -- [41] sitka0-ocr-049.txt -- [42] sitka0-ocr-050.txt -- [43] sitka0-ocr-051.txt -- [44] sitka0-ocr-052.txt -- [45] sitka0-ocr-053.txt -- [46] sitka0-ocr-054.txt -- [Blank Page] sitka0-ocr-055.txt -- [47] sitka0-ocr-056.txt -- [48] sitka0-ocr-057.txt -- [49] sitka0-ocr-058.txt -- [50] sitka0-ocr-059.txt -- [51] sitka0-ocr-060.txt -- [52] sitka0-ocr-061.txt -- [53] sitka0-ocr-062.txt -- [54] sitka0-ocr-063.txt -- [Blank Page] sitka0-ocr-064.txt -- [55] sitka0-ocr-065.txt -- [56] sitka0-ocr-066.txt -- sitka0-ocr-067.txt -- [57] sitka0-ocr-068.txt -- sitka0-ocr-069.txt -- [59] sitka0-ocr-070.txt -- [60] sitka0-ocr-071.txt -- sitka0-ocr-072.txt -- [61] sitka0-ocr-073.txt -- [62] sitka0-ocr-074.txt -- sitka0-ocr-075.txt -- [63] sitka0-ocr-076.txt -- [64] sitka0-ocr-077.txt -- [65] sitka0-ocr-078.txt -- [66] sitka0-ocr-079.txt -- sitka0-ocr-080.txt -- [67] sitka0-ocr-081.txt -- [68] sitka0-ocr-082.txt -- [69] sitka0-ocr-083.txt -- [70] sitka0-ocr-084.txt -- [71] sitka0-ocr-085.txt -- [72] sitka0-ocr-086.txt -- [73] sitka0-ocr-087.txt -- [74] sitka0-ocr-088.txt -- [75] sitka0-ocr-089.txt -- [76] sitka0-ocr-090.txt -- sitka0-ocr-091.txt -- [77] sitka0-ocr-092.txt -- [78] sitka0-ocr-093.txt -- [79] sitka0-ocr-094.txt -- [80] sitka0-ocr-095.txt -- [81] sitka0-ocr-096.txt -- [82] sitka0-ocr-097.txt -- [83] sitka0-ocr-098.txt -- 184] sitka0-ocr-099.txt -- [85] sitka0-ocr-100.txt -- [86] sitka0-ocr-101.txt -- [87] sitka0-ocr-102.txt -- [88] sitka0-ocr-103.txt -- [89] sitka0-ocr-104.txt -- [90] sitka0-ocr-105.txt -- [91] sitka0-ocr-106.txt -- [92] sitka0-ocr-107.txt -- [Blank Page] sitka0-ocr-108.txt -- [93] sitka0-ocr-109.txt -- [94] sitka0-ocr-110.txt -- [Blank Page] sitka0-ocr-111.txt -- [95] sitka0-ocr-112.txt -- [96] sitka0-ocr-113.txt -- [97] sitka0-ocr-114.txt -- [98] sitka0-ocr-115.txt -- [99] sitka0-ocr-116.txt -- [100] sitka0-ocr-117.txt -- sitka0-ocr-118.txt -- [101] sitka0-ocr-119.txt -- [102] sitka0-ocr-120.txt -- [103] sitka0-ocr-121.txt -- [104] sitka0-ocr-122.txt -- [105] sitka0-ocr-123.txt -- [106] sitka0-ocr-124.txt -- [107] sitka0-ocr-125.txt -- [108] sitka0-ocr-126.txt -- *** i will do a detailed look at that list, and explain everything in it, but you might wanna take a gander first, to see what _you_ see. since it might be more fun for you to figure it out for yourself, rather than plow through my pedantic bullshit... *** now, we need to do a little repair on some pages, as follows: the left-bracket was misrecognized on 3 files, so fix that: sitka017.txt -- 112] sitka025.txt -- 120] sitka098.txt -- 184] the first 4 pages are front-matter, so add some "f" pagenumbers: sitka001.txt -- add [f001] sitka002.txt -- add [f002] sitka003.txt -- add [f003] sitka004.txt -- add [f004] the first 2 pagenumbers were deleted by early proofers, so add back: sitka005.txt -- add [1] sitka006.txt -- add [2] page 6 really is a blank page, so let's add a pagenumber to it: sitka010.txt -- add [6] the pagenumber on 1 file wasn't picked up by scanner, so we'll add it: sitka068.txt -- add [58] the pagenumber on the last page, a map, wasn't there, so we'll add it: sitka126.txt -- add [109] the rest are illustration pages (even though some claim to be "blank"), which we can tell because they exist outside of the page-sequencing, so we'll add the "a" filenaming convention to slide them into place... append "a" to these unnumbered pages, which had no pagenumber: sitka015.txt -- add [10a} sitka066.txt -- add [56a} sitka074.txt -- add [62a} sitka079.txt -- add [66a} sitka090.txt -- add [76a} sitka117.txt -- add [100a} sitka030.txt -- change [blank page] to [24a] sitka043.txt -- change [blank page] to [36a] sitka054.txt -- change [blank page] to [46a] sitka063.txt -- change [blank page] to [54a] sitka107.txt -- change [blank page] to [92a] sitka110.txt -- change [blank page] to [94a] as i said in a short response to juliet yesterday, many of these missing and misrecognized pagenumbers _could_ have been "filled in" automatically, because of pagenumber redundancy. but editing them wasn't too difficult for this particular book... (i did the editing using my new editor interface, which i will be revealing to all you excited fans out there next week. oh boy!) *** once all of the pagenumbers in the files have been corrected, output from the above doglobal.pl script would look like this: sitka0-ocr-001.txt -- [f001] sitka0-ocr-002.txt -- [f002] sitka0-ocr-003.txt -- [f003] sitka0-ocr-004.txt -- [f004] sitka0-ocr-005.txt -- [1] sitka0-ocr-006.txt -- [2] sitka0-ocr-007.txt -- [3] sitka0-ocr-008.txt -- [4] sitka0-ocr-009.txt -- [5] sitka0-ocr-010.txt -- [6] sitka0-ocr-011.txt -- [7] sitka0-ocr-012.txt -- [8] ... sitka0-ocr-024.txt -- [19] sitka0-ocr-025.txt -- [20] sitka0-ocr-026.txt -- [21] sitka0-ocr-027.txt -- [22] sitka0-ocr-028.txt -- [23] sitka0-ocr-029.txt -- [24] sitka0-ocr-030.txt -- [24a] sitka0-ocr-031.txt -- [25] ... sitka0-ocr-126.txt -- [109] *** then we can do a variant of that output, to do the renaming for us: rename sitka0-ocr-001.txt as sitkaf001.txt rename sitka0-ocr-002.txt as sitkaf002.txt rename sitka0-ocr-003.txt as sitkaf003.txt rename sitka0-ocr-004.txt as sitkaf004.txt rename sitka0-ocr-005.txt as sitkap001.txt rename sitka0-ocr-006.txt as sitkap002.txt rename sitka0-ocr-007.txt as sitkap003.txt rename sitka0-ocr-008.txt as sitkap004.txt rename sitka0-ocr-009.txt as sitkap005.txt rename sitka0-ocr-010.txt as sitkap006.txt rename sitka0-ocr-011.txt as sitkap007.txt rename sitka0-ocr-012.txt as sitkap008.txt ... rename sitka0-ocr-024.txt as sitkap019.txt rename sitka0-ocr-025.txt as sitkap020.txt rename sitka0-ocr-026.txt as sitkap021.txt rename sitka0-ocr-027.txt as sitkap022.txt rename sitka0-ocr-028.txt as sitkap023.txt rename sitka0-ocr-029.txt as sitkap024.txt rename sitka0-ocr-030.txt as sitkap024a.txt rename sitka0-ocr-031.txt as sitkap025.txt ... rename sitka0-ocr-126.txt as sitkap109.txt *** remember that we have to do the scan files as well. (we'll just do a global change from ".txt" to ".png".) rename sitka0-ocr-001.png as sitkaf001.png rename sitka0-ocr-002.png as sitkaf002.png rename sitka0-ocr-003.png as sitkaf003.png rename sitka0-ocr-004.png as sitkaf004.png rename sitka0-ocr-005.png as sitkap001.png rename sitka0-ocr-006.png as sitkap002.png rename sitka0-ocr-007.png as sitkap003.png rename sitka0-ocr-008.png as sitkap004.png rename sitka0-ocr-009.png as sitkap005.png rename sitka0-ocr-010.png as sitkap006.png rename sitka0-ocr-011.png as sitkap007.png rename sitka0-ocr-012.png as sitkap008.png ... rename sitka0-ocr-024.png as sitkap019.png rename sitka0-ocr-025.png as sitkap020.png rename sitka0-ocr-026.png as sitkap021.png rename sitka0-ocr-027.png as sitkap022.png rename sitka0-ocr-028.png as sitkap023.png rename sitka0-ocr-029.png as sitkap024.png rename sitka0-ocr-030.png as sitkap024a.png rename sitka0-ocr-031.png as sitkap025.png ... rename sitka0-ocr-126.png as sitkap109.png *** this example makes it pretty clear that -- if you only leave the pagenumbers in the o.c.r., just leave 'em! -- it's pretty easy to use them to name your files wisely... pagenumbers in the runhead are easy to grab as well. they're either at the right side of the runhead (if odd) or at the left side of the runhead (on the even pages). (the runhead is usually the first line in the file, right?, but sometimes the pagenumber drops to the second. still it's usually the first _number_ you find in the file, so it's easy enough to code your script to look for that.) again, you have to check them!, to make sure they were recognized correctly, so you can fix 'em if they weren't. but once you've got them all in place, you are golden... and the beauty is that now your files are named wisely! you'll always know page 23 is in the file named "p023", and page 46 is in "p046", and page 123 is in "p123"... moreover, when you want to go to page 46, you will actually _end_up_ on page 46, not some other page that is kinda close, depending on what the "offset" is! *** and here's another nice thing. you'll notice that we had some unnumbered pages that were named with an appended "a"? well, we need to keep the recto and the verso straight, if we want to make good e-books, so we can't just add an "a" without a backside "b" too. but hey, that's no problem at all! after each "a" page, we just slide in a "b" name underneath it, and presto!, our recto/verso is right again. and we didn't have to _readjust_ all filenames that followed each "insertion", because those files were wisely-named to begin with. *** there's one more thing to talk about: coding apps... (if you don't do coding, you can leave now if you want; but it probably won't hurt you to read the rest of this. you made it _this_ far, so you must be a glutton for it.) first let's get the necessary admission out of the way... it's very easy to do your coding when you name your files in a stupid 001.txt-999.txt way, because you can simply code the number as a shortcut for the filename. you use an integer for your pagenumber, and it's easy. your _files_ go from 1 to 999, and so do your _names_. it's easy to keep track of things; you just go up or down. because of this ease, i can understand why you _might_ want to keep using those stupid filenames. but don't... still, at first, it may not be immediately obvious to you how to depart from this method. but it really is simple. instead of thinking of each filename as a _number_ (i.e., an integer), think of it as a "name" (i.e., a string). yes, the filename has a number _in_ it, and the number is the _important_part_ (to your end-user), but do not _think_ of it in this way, at least not for the time being. think of the filename as a string, nothing but a string... however, you will _load_ those strings into an _array_... you'll have as many items in the array as you have files, and the value of each item will be the _name_ of the file. then you think of the _index_ for that array as an integer -- because that's what it is! -- and you use _that_ in the exact same way you used your pagenumber integer before. so see, you didn't have to give up the easy convenience of a number to keep on-track like you thought you'd have to. your index array goes up and down, just like it did before. in other words, you can still think of your _files_ as going from 1 to 999, and increment your array index as before. but whenever you want to know the _filename_ of a page, you look-up the value of the array at that index-number. so let's look at how this would work for our "sitka" book. the string value of item array #1 would be "sitkaf001". the string value of item array #2 would be "sitkaf002". the string value of item array #5 would be "sitkap001", because it's page 1, and that's where the foreword starts. the string value of item array #11 would be "sitkap007", because that's page 7, and that's where chapter 1 starts... and the string value of item array #126 would be "sitkap109", and that's the map that's on the last (recto) page in the book. (of course, there will be a blank verso that'll be "sitkap110", since a book cannot have an odd number of pages, can it?) so the last question is "how do i populate the filename array?" there are various ways you can do it, but two good ones are: 1. read the book's subdirectory to glean the graphic filenames. 2. create a "map" file intended to provide the graphic filenames. you can also combine these 2 methods as "belt and suspenders"; you create a map file, but your viewer-app confirms the map by reading the subdirectory to ensure all the graphic files are there. it's not nearly as difficult to create a "map" file as you might think. for instance, look closely at the sitka file we're working on: > http://z-m-l.com/go/jimad/sitka0-ocr.txt just pull out the separator lines, and you've got your map file. of course, the current version of that file is using the current stupid filenames, but you can generate a new concatenated file after you've renamed your .txt files, and your map will be fine. you can also just view your subdirectory structure in a browser, and copy out the filenames, and save them in a file, and bingo!, there's your map file. myself, with z.m.l., i use the separator-line method, as you can see if you look at any paginated z.m.l. file. the lines which have double-braces enclosing a graphic filename constitute the map. *** all in all, if you start naming your files intelligently, you'll find that the benefits far outweigh the costs of doing any rename... still, i've tried here, in this post, to show you how to do a rename in the easiest possible way. just remember not to do like d.p. -- _and_keep_the_darn_filename_information_in_your_o.c.r._files_... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Mar 19 06:03:54 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Mar 2010 09:03:54 EDT Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output Message-ID: <6dee.30b7d3b4.38d4d03a@aol.com> keith, you have no pudding. you have a lot of cards, which purport to have recipes on them, but i cannot make heads or tails of them, and they certainly have no taste, nor can they be eaten, so -- not to be mean or anything, but -- what good are they? or maybe it's just me. if someone else can explain to us just exactly what it is that keith is talking about, please do. thank you. have a nice day. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Fri Mar 19 10:59:21 2010 From: jimad at msn.com (James Adcock) Date: Fri, 19 Mar 2010 10:59:21 -0700 Subject: [gutvol-d] [SPAM] RE: Re: jim, i have some questions about pgdiff output In-Reply-To: <8c234.435b34c3.38d40831@aol.com> References: <8c234.435b34c3.38d40831@aol.com> Message-ID: I've put up a new copy of the tool pgdiff that contains an option "-smarted" which outputs the text in a form similar to what I think you want BB for your "smart editor" tool. It is similar to what pgdiff originally output but that output I had found too tedious and verbose for my taste when I am editing the output using a regex editor. Your suggestions work in simple cases but I think you will find that they fail relatively spectacularly on difficult cases, such as when performing versioning across different editions. I also updated the example output file "BBoutput.txt" to show the new output. "Non-diff" material will show up in the output if the "Non-diff" material is in a mixed order. For example if the two files have: The quick dog jumps... And the other file has: The dog quick jumps. Then dog and/or quick will show up in the edits because there is no way you can do a Levenshtein edit that doesn't include both "dog" and "quick" because the Levenshtein measure doesn't include a notion of "reverse the order of these two tokens." Also you may THINK two tokens are identical but they aren't identical unless they ARE identical - the measure also doesn't have a notion of "these two tokens look really similar so I want them to match up." Either tokens match or they don't. So in the case of: The quick dogs jumps.. Vs. The dog quick jumps.. The algorithm isn't going to try to match up "dog" and "dogs" because it has no notion of token "similarity" - "dog" and "dogs" are simply two different tokens and they don't match. Further, even if they do match they still may not compare to each other if there are nearby edits that also don't match, such that the total number of "insert" "delete" and "substitute" edits is minimized by NOT making the two identical tokens match up. If you look carefully at the output of diff you will see it has the same problem (where a "token" is a line of text not a word) - diff DOES NOT always "successfully" match up two lines of identical text - because like pgdiff diff isn't trying to maximize the number of token matches, rather it is trying to minimize the number of Levenshtein edits. Again, the problem is basically the domain you are interested in working on and the domain I am interested in working on is very different. You want a tool that catches small changes within a line of text, and I want a tool that catches large changes within a file. It is easy to hypothesize what the "answer" is if you are not the one doing the work. But if you are the one doing the work you rapidly find "oops that idea doesn't work after all!" The real goal of the tool is to find places in the text where a human bean needs to step in to fix the problem, and that it does extremely well when the human bean is driving a regex editor and looking at a copy of the original bitmap page. If one wants to try to do a "smart editor" sometimes its going to work and other times its going to fail spectacularly - other than identifying there IS a problem - and then again the human bean is going to have to sort out and fix the problem. In the worse case this involves deleting the text being questioned and typing in the text seen on the bitmap page - which again is not typically a terrible situation - if you have a tool that will point you to the problem in the first place which certainly pgdiff does. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimad at msn.com Fri Mar 19 11:29:21 2010 From: jimad at msn.com (James Adcock) Date: Fri, 19 Mar 2010 11:29:21 -0700 Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output In-Reply-To: <72F0393F-49CB-428D-9E64-6E752997D720@uni-trier.de> References: <72F0393F-49CB-428D-9E64-6E752997D720@uni-trier.de> Message-ID: > Proofing is per se linear, has relatively few differences, and is aided by humans, and a new version is to be created and not a merge. The process is simple compare text A and B as long as they are equal and then gather the information as long as the differ, present the difference, offer possible changes, continue. Without much analysis one can see that this process is linear. Agreed -- although again you run into problems when your assumptions break down. Pgdiff wasn't intended for these simply "change a couple letters within a line of text" problems. It was intended for problems of the nature of "I have two different editions of the text from two different continents one using English spellings and one using American spellings and having different linebreaks and different pagebreak and different intros and censorship and different indexes and I want to use one to help find scannos in the other." Yes it can be used for simpler tasks but if you have a simpler task you might be better off to figure out exactly what that task is and write a tool to match that task. Human edits within line tend to be char-by-char and you might be better off using a Levenshtein measure with the "token" set to be a char and the "string" set to be a line of text -- to give an obvious example -- since its not obvious to me how someone uses a mouse and a keyboard to make changes other than "insert a char" "delete a char" or "substitute a char" -- unless one uses cut and paste, in which case all assumptions are off again.... From jimad at msn.com Fri Mar 19 11:35:21 2010 From: jimad at msn.com (James Adcock) Date: Fri, 19 Mar 2010 11:35:21 -0700 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: <6840.ffe3dd8.38d4cebc@aol.com> References: <6840.ffe3dd8.38d4cebc@aol.com> Message-ID: >ok, on the "good news" front, it appears that rfrank has finally decided to start naming his files more wisely, so big respect to the people who steered in that direction. How do you propose to deal with texts that have a large number of "prefix" pages numbered something like "iii" for example? How do you propose to deal with texts that have a large number of "prefix" pages which are not numbered at all? How do you propose to deal with texts where the numbering scheme was screwed up in the original text? How do you propose to deal with texts which do not count illustration pages in their numbering scheme? Etc. Again, it's great to have a simple system that works except when it doesn't work in which case it's not so simple anymore. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Mar 19 13:11:48 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Mar 2010 16:11:48 EDT Subject: [gutvol-d] Re: "that doesn't work" Message-ID: <23e82.33ce9749.38d53484@aol.com> jim said: > I?ve put up a new copy of the tool pgdiff > that contains an option ?-smarted? which > outputs the text in a form similar to what > I think you want BB for your ?smart editor? tool.? i'm sure some of your users will enjoy that option, jim... as you might expect, i'll likely stick with my own tools... but yes, this new option might allow me to perfect the tool that i've built in support of your pgdiff tool, i hope. > Your suggestions work in simple cases but?I think > you will find that they fail relatively spectacularly > on difficult cases, such as when performing versioning > across different editions. well, if i'm gonna fail, please let me fail "spectacularly". i compare different editions using a different technique; essentially i do a _paragraph-level_ comparison for that. it's easy enough to unwrap texts to the paragraph level. indeed, i do paragraph-level analyses in my comparisons all the time. that's how i catch the paragraphing glitches. (it's also necessary to work at the paragraph level when you're fixing spacey-quotes, as i have mentioned before.) > I also updated the example output file ?BBoutput.txt? > to show the new output. great. i'll go get it this afternoon... > Again, the problem is basically the domain > you are interested in working on and the domain > I am interested in working on is very different. actually, they're not. but that's another question for another day. here today's issue is finding and fixing errors by comparing two versions which are similar... > You want a tool that catches small changes > within a line of text, and I want a tool that catches > large changes within a file. two rejoinders. first, my tools are capable of finding "large differences" if they are what exist. but, like i just said, that arena is not of much particular interest here on the p.g. listserve. second, i have -- without knowing it at first -- worked on doing comparisons between what turned out to be different editions of a book. and most of the changes were not "large" ones, but rather "small" ones, notably punctuation variations reflecting different house "styles". i discussed this particular comparison at _great_ length over on the d.p. forums, under a thread with a title like "a revolutionary method of proofing", if you're interested. > It is easy to hypothesize what the ?answer? is > if you are not the one doing the work. i agree. that's why i suggested we work on actual data. i find it best if i don't bias my research by selecting the data that i work on, so i work on other people's stuff, which is why i choose that book from rfrank. however, if you want to share some data on a book of your own, one you're working on, i would be happy to look at it... > But if you are the one doing the work you rapidly find > ?oops that idea doesn?t work after all!?? you know, i hear a lot of people saying "that doesn't work". but usually, they're being bamboozled by some _small_ issue that can be overcome quite easily if they just try... a good example of that was yesterday, when juliet said "your renaming solution won't work because pagenumbers are often misrecognized." well, yeah, that happens, but that particular "obstacle" can be hurdled with little effort. so i invite you to bring any "doesn't work" problems to me... i like the challenge of seeing if i can make it work regardless. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Fri Mar 19 13:33:30 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Mar 2010 16:33:30 EDT Subject: [gutvol-d] Re: save those pagenumber references Message-ID: <254bd.5074f290.38d5399a@aol.com> jim said: > How do you propose to deal with texts that have a large number > of ?prefix? pages numbered something like ?iii? for example? > > How do you propose to deal with texts that have a large number > of ?prefix? pages which are not numbered at all? > > How do you propose to deal with texts where the numbering > scheme was screwed up in the original text? > > How do you propose to deal with texts which do not count > illustration pages in their numbering scheme? > > Etc. > > Again, it?s great to have a simple system that works except > when it doesn?t work in which case it?s not so simple anymore. gee, jim. i just talked about how people get bamboozled by small issues, which can be hurdled quite easily if you just set your mind to it... and here you make a reply with a whole handful of small issues. not even "small", really... more like _tiny_... even _teeny-tiny_... indeed, if you really look at the example i discussed, you'll see that several of your questions were answered there _already_... so i'm not even going to go through the exercise of answering. if you really want answers, you can generate them yourself, or go back and look where i have been discussing this issue for _many_years_, and review any one of those exhausting threads. there _is_ such a thing as a stupid question. i've asked them myself, as have all of us. and jim, you just asked a _handful._ but you know, jim, the thing i'm wondering is this... i've held this position on intelligent filenaming conventions for _years_ now. and that's just counting on _this_listserve._ i've been practicing what i preach for about two decades now. if there was really some problem with my system, don't you think i would have discovered it by now? do you really think that you can come up with a reaction in your first 5 minutes that i haven't experienced in the years and years and years i've been doing intensely close analysis of book digitization? i mean, _seriously_... did you really think i just happened to "overlook" that books generally have forward-matter pages, and that those pages have a different pagenumber sequence? and do you really think i just hadn't ever noticed that some of the illustration-pages in books are unnumbered pages? really? so let me say this _again_, jim... if you want to have dialog with me, you _cannot_ say stupid things. you simply cannot. because i won't continue to talk with you if you do. capiche? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Fri Mar 19 17:09:19 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Fri, 19 Mar 2010 17:09:19 -0700 Subject: [gutvol-d] Re: save those pagenumber references References: <6840.ffe3dd8.38d4cebc@aol.com> Message-ID: Jim, the material below describes the scanset naming standard used by PG. For an example, go to http://www.gutenberg.org/etext/25896, click on the Base Directory link at the bottom of the book's catalog info, above the actual files. You should see a 25896-page-images folder. Click on it to see the actual files. Al Basic format: The prefix for the cover pages is: "c". The prefix for the roman pages is: "f". The prefix for the arabic pages is: "p". *** For blank pages there should be no file and the page number should be skipped. Optionally an image saying: "This page is blank in the original." may be inserted. *** Example of file naming: front cover c0001.png back cover c0002.png spine c0003.png i title page f0001.png ii title verso f0002.png iii dedication f0003.png iv is blank v contents f0005.png page 1 p0001.png page 2 p0002.png image on page 2 p0002-image1.png image on page 2 p0002-image2.png page 3 p0003.png page 4 is blank page 5 p0005.png ... ... page 9999 p9999.png ----- Original Message ----- From: James Adcock To: 'Project Gutenberg Volunteer Discussion' Sent: Friday, March 19, 2010 11:35 AM Subject: [gutvol-d] Re: save those pagenumber references >ok, on the "good news" front, it appears that rfrank has finally decided to start naming his files more wisely, so big respect to the people who steered in that direction. How do you propose to deal with texts that have a large number of "prefix" pages numbered something like "iii" for example? How do you propose to deal with texts that have a large number of "prefix" pages which are not numbered at all? How do you propose to deal with texts where the numbering scheme was screwed up in the original text? How do you propose to deal with texts which do not count illustration pages in their numbering scheme? Etc. Again, it's great to have a simple system that works except when it doesn't work in which case it's not so simple anymore. _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d From dakretz at gmail.com Fri Mar 19 17:27:56 2010 From: dakretz at gmail.com (don kretz) Date: Fri, 19 Mar 2010 17:27:56 -0700 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: References: <6840.ffe3dd8.38d4cebc@aol.com> Message-ID: <627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com> So far this "spec" seems to be primarily a legend. Is it documented anywhere? On Fri, Mar 19, 2010 at 5:09 PM, Al Haines (shaw) wrote: > Jim, the material below describes the scanset naming standard used by PG. > For an example, go to http://www.gutenberg.org/etext/25896, click on the > Base Directory link at the bottom of the book's catalog info, above the > actual files. You should see a 25896-page-images folder. Click on it to > see the actual files. > > Al > > > > Basic format: > > The prefix for the cover pages is: "c". > The prefix for the roman pages is: "f". > The prefix for the arabic pages is: "p". > > *** > > For blank pages there should be no file and the page number should be > skipped. Optionally an image saying: "This page is blank in the > original." may be inserted. > > *** > > Example of file naming: > > front cover c0001.png > back cover c0002.png > spine c0003.png > > i title page f0001.png > ii title verso f0002.png > iii dedication f0003.png > iv is blank > v contents f0005.png > > page 1 p0001.png > page 2 p0002.png > image on page 2 p0002-image1.png > image on page 2 p0002-image2.png > page 3 p0003.png > page 4 is blank > page 5 p0005.png > ... ... > page 9999 p9999.png > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Fri Mar 19 17:46:12 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Fri, 19 Mar 2010 17:46:12 -0700 Subject: [gutvol-d] Re: save those pagenumber references References: <6840.ffe3dd8.38d4cebc@aol.com> <627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com> Message-ID: <14A40D8B90F144F7937A4DDF3839F373@alp2400> No. It was developed and used by Joshua Hutchinson, when he used to post DP scansets to PG. I got the material from his emails to the WWers a few months ago. So far as I know, only Joshua has done this with scans. Some submitters (usually from DP) incorporate scansets into their HTML files, with the file's page numbers linked to the scans. Offhand, I don't have any examples of this. ----- Original Message ----- From: don kretz To: Project Gutenberg Volunteer Discussion Sent: Friday, March 19, 2010 5:27 PM Subject: [gutvol-d] Re: save those pagenumber references So far this "spec" seems to be primarily a legend. Is it documented anywhere? On Fri, Mar 19, 2010 at 5:09 PM, Al Haines (shaw) wrote: Jim, the material below describes the scanset naming standard used by PG. For an example, go to http://www.gutenberg.org/etext/25896, click on the Base Directory link at the bottom of the book's catalog info, above the actual files. You should see a 25896-page-images folder. Click on it to see the actual files. Al Basic format: The prefix for the cover pages is: "c". The prefix for the roman pages is: "f". The prefix for the arabic pages is: "p". *** For blank pages there should be no file and the page number should be skipped. Optionally an image saying: "This page is blank in the original." may be inserted. *** Example of file naming: front cover c0001.png back cover c0002.png spine c0003.png i title page f0001.png ii title verso f0002.png iii dedication f0003.png iv is blank v contents f0005.png page 1 p0001.png page 2 p0002.png image on page 2 p0002-image1.png image on page 2 p0002-image2.png page 3 p0003.png page 4 is blank page 5 p0005.png ... ... page 9999 p9999.png ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Fri Mar 19 18:58:51 2010 From: dakretz at gmail.com (don kretz) Date: Fri, 19 Mar 2010 18:58:51 -0700 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: <14A40D8B90F144F7937A4DDF3839F373@alp2400> References: <6840.ffe3dd8.38d4cebc@aol.com> <627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com> <14A40D8B90F144F7937A4DDF3839F373@alp2400> Message-ID: <627d59b81003191858q7c3bd468v2869a8185d875653@mail.gmail.com> OK - I got those from Joshua and they formed the requirements for Version 1 of Twister. Here's an extensive forum thread on DP where we hashed this all out. On Fri, Mar 19, 2010 at 5:46 PM, Al Haines (shaw) wrote: > No. It was developed and used by Joshua Hutchinson, when he used to post > DP scansets to PG. I got the material from his emails to the WWers a few > months ago. > > So far as I know, only Joshua has done this with scans. Some submitters > (usually from DP) incorporate scansets into their HTML files, with the > file's page numbers linked to the scans. Offhand, I don't have any examples > of this. > > > > ----- Original Message ----- > *From:* don kretz > *To:* Project Gutenberg Volunteer Discussion > *Sent:* Friday, March 19, 2010 5:27 PM > *Subject:* [gutvol-d] Re: save those pagenumber references > > So far this "spec" seems to be primarily a legend. > > Is it documented anywhere? > > On Fri, Mar 19, 2010 at 5:09 PM, Al Haines (shaw) wrote: > >> Jim, the material below describes the scanset naming standard used by PG. >> For an example, go to http://www.gutenberg.org/etext/25896, click on the >> Base Directory link at the bottom of the book's catalog info, above the >> actual files. You should see a 25896-page-images folder. Click on it to >> see the actual files. >> >> Al >> >> >> >> Basic format: >> >> The prefix for the cover pages is: "c". >> The prefix for the roman pages is: "f". >> The prefix for the arabic pages is: "p". >> >> *** >> >> For blank pages there should be no file and the page number should be >> skipped. Optionally an image saying: "This page is blank in the >> original." may be inserted. >> >> *** >> >> Example of file naming: >> >> front cover c0001.png >> back cover c0002.png >> spine c0003.png >> >> i title page f0001.png >> ii title verso f0002.png >> iii dedication f0003.png >> iv is blank >> v contents f0005.png >> >> page 1 p0001.png >> page 2 p0002.png >> image on page 2 p0002-image1.png >> image on page 2 p0002-image2.png >> page 3 p0003.png >> page 4 is blank >> page 5 p0005.png >> ... ... >> page 9999 p9999.png >> >> >> ------------------------------ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Mar 20 09:43:42 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 20 Mar 2010 12:43:42 EDT Subject: [gutvol-d] Re: save those pagenumbers! Message-ID: <4e8ac.4d6c005e.38d6553e@aol.com> i have a reply in the works, but i won't post it until monday... so keep your minds open until then, and have a nice weekend! -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Sat Mar 20 18:01:13 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 20 Mar 2010 21:01:13 EDT Subject: [gutvol-d] a sitka smoothreading glitch Message-ID: <62f1f.ad509d4.38d6c9d9@aol.com> rfrank's roundless site -- fadedpage.com -- now has their "sitka" book available for _smooth-reading_, at: > http://www.fadedpage.com/s/sitka/sitka.htm i'll have a lot of nice things to say about rfrank's work, because his e-books really do look quite clean and nice, but -- since this book is in-process and all -- for now, i'll just report a little glitch, on (waitforit) pagenumbers. seems a stray page-indicator found its way onto page 96, so that page 96 is now incorrectly short, with some of its text now being shown on what's called page 97, and with every page after page 96 having its pagenumber off by 1, so the last words ("may be made") are incorrectly reported as page 109, when in actuality they occurred on page 108. that kind of thing can happen when the pagenumbers are something that you recompute at the end of the process, rather than something integral to your entire workflow... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua at hutchinson.net Sun Mar 21 07:22:53 2010 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Sun, 21 Mar 2010 14:22:53 +0000 (GMT) Subject: [gutvol-d] Re: save those pagenumber references References: <6840.ffe3dd8.38d4cebc@aol.com> <627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com> <14A40D8B90F144F7937A4DDF3839F373@alp2400> <627d59b81003191858q7c3bd468v2869a8185d875653@mail.gmail.com> Message-ID: <102322535.23878.1269181373985.JavaMail.mail@webmail04> An HTML attachment was scrubbed... URL: From gbuchana at teksavvy.com Sun Mar 21 09:45:19 2010 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Sun, 21 Mar 2010 12:45:19 -0400 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: <102322535.23878.1269181373985.JavaMail.mail@webmail04> References: <6840.ffe3dd8.38d4cebc@aol.com> <627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com> <14A40D8B90F144F7937A4DDF3839F373@alp2400> <627d59b81003191858q7c3bd468v2869a8185d875653@mail.gmail.com> <102322535.23878.1269181373985.JavaMail.mail@webmail04> Message-ID: <4BA64D1F.5030008@teksavvy.com> Is there any suggestion what a formatted text-only book that retains page numbers should look like? Is it reasonable to just sprinkle them into the text, maybe something like this: --------- Captain Headley, musingly pressing his hand to his brow, "and how unfortunate. Had Winnebeg brought General Hull's despatch one day sooner, all this would not have happened, for they never could have obtained [35] permission to leave the fort, much less to visit so dangerous a vicinity as Hardscrabble. Our march from this would have changed the whole current of events." "Even so," returned Mrs. Headley; "but here is a packet, left with Serjeant Nixon, which he has just handed to me, and which may throw some light on the subject. I will first glance over it myself." ----------- "God bless you, Ronayne! Alas, you are not alone in, your trials--much of moment awaits us all. Good night!" And, assuming her disguise, she speedily regained her home. [44] CHAPTER X. "Ne'er may he live to see a sunshine day that cries--Retire, when Warwick bids him stay." --_Henry IV._ On the western bank of the south side of the Chicago River, and ----------- ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From ajhaines at shaw.ca Sun Mar 21 10:34:05 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sun, 21 Mar 2010 10:34:05 -0700 Subject: [gutvol-d] Re: save those pagenumber references References: <6840.ffe3dd8.38d4cebc@aol.com> <627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com> <14A40D8B90F144F7937A4DDF3839F373@alp2400> <627d59b81003191858q7c3bd468v2869a8185d875653@mail.gmail.com> <102322535.23878.1269181373985.JavaMail.mail@webmail04> <4BA64D1F.5030008@teksavvy.com> Message-ID: <5248C1F0A99C4852BE91C37F933C482D@alp2400> There are two articles in the PG Volunteers' FAQ about page numbers: http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.98._Should_I_keep_page_numbers_in_the_e-text.3F and http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.99._In_the_exceptional_cases_where_I_keep_page_numbers.2C_how_should_I_format_them.3F My personal practice is to use curly braces for page numbers and square brackets for footnote numbers. I include page numbers in an etext only if the book has internal references of some kind, e.g. footnotes that refer to specific pages, an index, or a table of contents that's sufficiently detailed as to function as an index. I number only the first page of an index, since I've never seen one with references to elsewhere in itself. Two-column indexes are rendered as single-column. For examples, see http://www.gutenberg.org/etext/19765 or http://www.gutenberg.org/etext/30610. Al ----- Original Message ----- From: "Gardner Buchanan" To: "Project Gutenberg Volunteer Discussion" Sent: Sunday, March 21, 2010 9:45 AM Subject: [gutvol-d] Re: save those pagenumber references > Is there any suggestion what a formatted text-only book that > retains page numbers should look like? Is it reasonable to just > sprinkle them into the text, maybe something like this: > > --------- > Captain Headley, musingly pressing his hand to his brow, "and how > unfortunate. Had Winnebeg brought General Hull's despatch one day > sooner, all this would not have happened, for they never could have > obtained [35] permission to leave the fort, much less to visit so > dangerous a vicinity as Hardscrabble. Our march from this would > have changed the whole current of events." > > "Even so," returned Mrs. Headley; "but here is a packet, left with > Serjeant Nixon, which he has just handed to me, and which may throw > some light on the subject. I will first glance over it myself." > ----------- > "God bless you, Ronayne! Alas, you are not alone in, your trials--much > of moment awaits us all. Good night!" > > And, assuming her disguise, she speedily regained her home. > > > [44] > > > CHAPTER X. > > "Ne'er may he live to see a sunshine day that cries--Retire, > when Warwick bids him stay." > --_Henry IV._ > > On the western bank of the south side of the Chicago River, and > ----------- > > > ============================================================ > Gardner Buchanan > Ottawa, ON FreeBSD: Where you want to go. Today. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d > From Bowerbird at aol.com Sun Mar 21 11:18:52 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 21 Mar 2010 14:18:52 EDT Subject: [gutvol-d] Re: save those pagenumber references Message-ID: <7d0a7.5b337a92.38d7bd0c@aol.com> al said: > My personal practice is and therein lies the rub. the p.g. e-texts are rife with "personal practice". and the d.p. e-texts are soaking in it right now... one thing you have to know about pagenumbers is some people need 'em and other people hate 'em... which means you have to have 'em, and you have to give people a way to shut them off... _totally_ off... the only way to do that is to establish a convention, so viewer-app developers can make everyone happy. most "personal practice" implementations try to walk the tightrope between the two sides, and fail _both_, in the sense they don't do the _full_ job that the "pro" people want pagenumbers to do, but yet aren't nearly as non-invasive as the "anti" people reasonably want. if a hundred different digitizers do it a hundred ways -- or a thousand digitizers do it a thousand ways -- nobody is gonna end up happy; we'll all be miserable. and face it, if both sides are going to end up unhappy, you might as well flip a coin and make one side happy. the only way to make it work is to do it _one_way_... so developers can target the convention successfully. michael isn't going to prescribe this remedy for p.g. even if he tried, he probably would not succeed, and he has made it clear that he doesn't even want to try and do things like that, as per his basic philosophy... nobody else has a remote chance of success with p.g. so alas, it is not to be. but perhaps it doesn't matter. because it's becoming increasingly clear that the only cyberlibrary that's going to matter is the google one, and -- after a few missteps at the very beginning -- google has gotten pretty smart about pagenumbers... so whatever conventions they establish will stick. *** but, to answer the question... gardner said: > Is there any suggestion what > a formatted text-only book that > retains page numbers should look like? > Is it reasonable to just sprinkle them > into the text, maybe something like this: there's nothing difficult about the issue, technically. you wouldn't want to "sprinkle them" thoughtlessly, but any number of _well-specified_conventions_ can handle the tiny number of wrinkles that do crop up... (tell me if you want me to dredge my memory-bank to catalog them, but there seriously aren't too many.) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajhaines at shaw.ca Sun Mar 21 11:33:44 2010 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sun, 21 Mar 2010 11:33:44 -0700 Subject: [gutvol-d] Re: save those pagenumber references References: <7d0a7.5b337a92.38d7bd0c@aol.com> Message-ID: What bowerbird failed (or didn't bother) to mention was that using curly braces for page numbers and square brackets for footnotes are practices that are documented in PG's Volunteers' FAQ (V.98, V.99, V.103). As such, my "personal practice" is not an invention of my own, but are PG-standard, documented, practices that I've adopted for my projects. Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; bowerbird at aol.com Sent: Sunday, March 21, 2010 11:18 AM Subject: [gutvol-d] Re: save those pagenumber references al said: > My personal practice is and therein lies the rub. the p.g. e-texts are rife with "personal practice". and the d.p. e-texts are soaking in it right now... one thing you have to know about pagenumbers is some people need 'em and other people hate 'em... which means you have to have 'em, and you have to give people a way to shut them off... _totally_ off... the only way to do that is to establish a convention, so viewer-app developers can make everyone happy. most "personal practice" implementations try to walk the tightrope between the two sides, and fail _both_, in the sense they don't do the _full_ job that the "pro" people want pagenumbers to do, but yet aren't nearly as non-invasive as the "anti" people reasonably want. if a hundred different digitizers do it a hundred ways -- or a thousand digitizers do it a thousand ways -- nobody is gonna end up happy; we'll all be miserable. and face it, if both sides are going to end up unhappy, you might as well flip a coin and make one side happy. the only way to make it work is to do it _one_way_... so developers can target the convention successfully. michael isn't going to prescribe this remedy for p.g. even if he tried, he probably would not succeed, and he has made it clear that he doesn't even want to try and do things like that, as per his basic philosophy... nobody else has a remote chance of success with p.g. so alas, it is not to be. but perhaps it doesn't matter. because it's becoming increasingly clear that the only cyberlibrary that's going to matter is the google one, and -- after a few missteps at the very beginning -- google has gotten pretty smart about pagenumbers... so whatever conventions they establish will stick. *** but, to answer the question... gardner said: > Is there any suggestion what > a formatted text-only book that > retains page numbers should look like? > Is it reasonable to just sprinkle them > into the text, maybe something like this: there's nothing difficult about the issue, technically. you wouldn't want to "sprinkle them" thoughtlessly, but any number of _well-specified_conventions_ can handle the tiny number of wrinkles that do crop up... (tell me if you want me to dredge my memory-bank to catalog them, but there seriously aren't too many.) -bowerbird ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From dakretz at gmail.com Sun Mar 21 12:16:31 2010 From: dakretz at gmail.com (don kretz) Date: Sun, 21 Mar 2010 12:16:31 -0700 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: References: <7d0a7.5b337a92.38d7bd0c@aol.com> Message-ID: <627d59b81003211216l1e020e97p99fe300aa2bf4d38@mail.gmail.com> PG needs for age numbers need to be there somewhere because without them there's no future hope for controlled/moderated text refinement. We need them to match up the canonical text with the canonical image and quickly verify that a proposed correction is legitimate. Whether the page number needs to be included in the downloaded "plain text" version, or whether the "plain text" version should be the canonical version are separate matters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Mon Mar 22 02:02:25 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 22 Mar 2010 10:02:25 +0100 Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output In-Reply-To: <6dee.30b7d3b4.38d4d03a@aol.com> References: <6dee.30b7d3b4.38d4d03a@aol.com> Message-ID: <3775A2FB-3BD0-499C-BF63-CB0DC894DEE2@uni-trier.de> BB, What should I say to you. To take one of my favorite quotes from Lotfih Zadeh "We are still confused, but on a higher level" ! Though I tell you this much I am at least thinking about a proof of concept. I also, am looking for the pieces I need. The biggest one will be the OCR engine that has the features I will need. What I can not promise is that I will keep up interest in it. regards Keith Am 19.03.2010 um 14:03 schrieb Bowerbird at aol.com: > keith, you have no pudding. > > you have a lot of cards, which purport to have recipes on them, > but i cannot make heads or tails of them, and they certainly have > no taste, nor can they be eaten, so -- not to be mean or anything, > but -- what good are they? > > or maybe it's just me. if someone else can explain to us > just exactly what it is that keith is talking about, please do. > > thank you. have a nice day. > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Mon Mar 22 02:19:54 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Mon, 22 Mar 2010 10:19:54 +0100 Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output In-Reply-To: References: <72F0393F-49CB-428D-9E64-6E752997D720@uni-trier.de> Message-ID: <3DEDF2F6-D95C-4820-83D4-2B4A4C1F3EAE@uni-trier.de> Hi James, I do understand the the levenstein measure and actually do not think we need to discuss it caveats far as precision and sucessfulness. An interesting approach by using English and American versions. Yet, that makes pgdiff specific to one set of languages. On the other side. If you out the problem of the forwards, tocs, and indices et. al. you could simply try adding in a component that rewrites with the others spelling conventions. That I know is no trivial task. As far as my considering not using diff and just a simple comparison method which is linear, the problem of alignment does remain. I admit I have done the math or have an exact algorithm but it does seem to me that it would be polynominal and still far better than n^2. regards Keith. Am 19.03.2010 um 19:29 schrieb James Adcock: >> Proofing is per se linear, has relatively few differences, and is > aided by > humans, and a new version is to be created and not a merge. > The process is simple compare text A and B as long as they are equal > and then gather the information as long as the differ, present the > difference, > offer possible changes, continue. > Without much analysis one can see that this process is linear. > > Agreed -- although again you run into problems when your assumptions break > down. Pgdiff wasn't intended for these simply "change a couple letters > within a line of text" problems. It was intended for problems of the nature > of "I have two different editions of the text from two different continents > one using English spellings and one using American spellings and having > different linebreaks and different pagebreak and different intros and > censorship and different indexes and I want to use one to help find scannos > in the other." Yes it can be used for simpler tasks but if you have a > simpler task you might be better off to figure out exactly what that task is > and write a tool to match that task. Human edits within line tend to be > char-by-char and you might be better off using a Levenshtein measure with > the "token" set to be a char and the "string" set to be a line of text -- to > give an obvious example -- since its not obvious to me how someone uses a > mouse and a keyboard to make changes other than "insert a char" "delete a > char" or "substitute a char" -- unless one uses cut and paste, in which case > all assumptions are off again.... > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d From jimad at msn.com Mon Mar 22 07:32:44 2010 From: jimad at msn.com (Jim Adcock) Date: Mon, 22 Mar 2010 07:32:44 -0700 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: <254bd.5074f290.38d5399a@aol.com> References: <254bd.5074f290.38d5399a@aol.com> Message-ID: >there _is_ such a thing as a stupid question. i've asked them myself, as have all of us. and jim, you just asked a _handful._ Perhaps there is such a thing as a "stupid answer" since the answer you gave recently addresses none of the issues I raised. From Bowerbird at aol.com Mon Mar 22 10:51:55 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Mar 2010 13:51:55 EDT Subject: [gutvol-d] Re: save those pagenumber references Message-ID: jim said: > Perhaps there is such a thing as a "stupid answer" since > the answer you gave recently addresses none of the issues I raised. i was quite clear that i was deliberately not answering your questions, precisely because they were stupid, and -- even further -- because their answers were contained in the example i'd given just previously. i'll work very hard, and go far out of my way, to have a good dialog. because i value that. which is the same reason i won't countenance someone polluting that dialog with posts that regress the progress. i've spent years refining my filenaming conventions. you spent 5 minutes and came up with some kindergarten-level questions. use another 5 minutes, and you can answer your own questions. maybe then you'll also know why i would rather spend 10 minutes of my own time writing _this_ post instead of 5 minutes writing a post that answered your stupid questions. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 22 13:02:52 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Mar 2010 16:02:52 EDT Subject: [gutvol-d] Re: save those pagenumber references Message-ID: <1aa54.6e0339c6.38d926ec@aol.com> al said: > What bowerbird failed (or didn't bother)? > to mention was that > using curly braces for page numbers > and square brackets for footnotes \ > are practices that are documented in > PG's Volunteers' FAQ (V.98, V.99, V.103).? > As such,?my "personal practice"?is not > an invention of my own, but are?PG-standard, > documented, practices that I've adopted for my projects. discussions here are often so pointless it's not worth bothering. and yet i persist. sometimes i think _i_ must be the stupid one. but then, no, i realize, no, it's not me that's the stupid one at all. *** so... what al failed (or didn't bother) to mention is that whether or not any contributor _follows_ the f.a.q. is entirely a personal matter up to them... and that's why the f.a.q. don't really matter much... not unless the whitewashers enforce a particular aspect. and this one has not been enforced. so the vast majority of the .txt files have no pagenumbers. (well, actually, it _is_ enforced. because v.98 actually instructs producers that they should _not_ keep pagenumbers, except in "exceptional" cases. al tried to slip a fast one by us there, eh?) so p.g. has failed to create a convention about how it is done, even inside of its own cyberlibrary, let alone _outside_ of itself. and let me tell you that i respect michael hart's _principled_ decision not to enforce a standard much more than i respect a naive belief that -- just because it's in the f.a.q. -- you have established a convention. i don't respect that naivety at all... on the other hand, michael's unwillingness to take a stand _has_ meant that the producers have overruled the f.a.q. d.p. postprocessors have taken to including pagenumber info in their .html versions over the course of the last few years... many now include the pagenumbers as a matter of _routine._ that's the good news. the bad news is that the laissez faire attitude is paramount in d.p. postprocessors. they do things however they want. and they change how they do things whenever they want to. so, over the course of those last few years, they've treated pagenumber info in countless ways, with zero consistency. so it will be difficult or impossible to construct a "standard" from the d.p. practices, especially since the information is buried in the source .html, and not evident on the surface. it's also the case that there continue to be major problems with _all_ of their implementations, for reasons that might well be unavoidable, such as browsers that do not support the kind of functionalities that might be necessary to walk that tightrope i talked about between "pro" and "anti" forces. but, for people who like to view the glass as being 2/10 full instead of 8/10 empty, please enjoy the fact that the people who finish off the e-texts at d.p. now value pagenumbers... yet al still remains clueless... and his cluelessness moves up to a higher level as well. because remember that the _reason_ we want a convention is so that the developers of viewer-programs will support it, by programming the necessary capabilities into their apps... does anyone know any app developers who have done that? i mean, besides _me_ with _my_ apps? yeah, i thought not... the convention, even if obtained, is just a means to an end. and the pointlessness continues... one of the useful aspects of pagenumbers, as don points out, is they allow us to refer back to the page-scans of the book... but the f.a.q. betrays no knowledge of this beneficial purpose, and thus fails to enlighten the e-text producers of this linkage. if it _was_ based on this broad goal, the f.a.q. would also show awareness that pagenumbers per se are but a small part of the overall needs, along with things like _the_original_linebreaks_ and _the_original_end-line-hyphenates_. without those other vital aspects, it's similar to "baking a cake" with sugar as your only ingredient; the thing you get out at the end won't be cake. i talk more about this in the reply i drafted over the weekend, which i still intend to send today, so i won't belabor it now... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 22 14:55:11 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Mar 2010 17:55:11 EDT Subject: [gutvol-d] Re: save those pagenumber references Message-ID: <23a5c.75e8cc64.38d9413f@aol.com> al said: > Basic format: > > The prefix for the cover pages is: "c". > The prefix for the roman pages is: "f". > The prefix for the arabic pages is: "p". > > *** > > For blank pages there should be no file and > the page number should be skipped. > Optionally an image saying: > "This page is blank in the original." > may be inserted. > > *** > > Example of file naming: > > front cover c0001.png > back cover c0002.png > spine c0003.png > > i title page f0001.png > ii title verso f0002.png > iii dedication f0003.png > iv is blank > v contents f0005.png > > page 1 p0001.png > page 2 p0002.png > image on page 2 p0002-image1.png > image on page 2 p0002-image2.png > page 3 p0003.png > page 4 is blank > page 5 p0005.png > ... ... > page 9999 p9999.png dkretz said: > So far this "spec" seems to be primarily a legend. > Is it documented anywhere? al said: > No. It was developed and used by Joshua Hutchinson dkretz said: > Here's an extensive forum thread on DP > where we hashed this all out. oh lord. *** where do i begin? seriously, this is such a mess. where do i begin? *** well, to start with the last comment first, this wasn't "hashed out" at all. it was just messed up, because josh and marcello are too stubborn to take good advice from me. and on a more general level, this all shows that d.p. and p.g. can mess things up even when they actually try to do the right thing. *** so let's go back and examine the problems... *** we'll need to start with a short history lesson. years back, there was a push to get the scans hosted at p.g. with the text, and p.g. said ok. but when people started posting their scans, i noticed they had been named very stupidly. most stupid was that the filenames contained _numbers_ that were _not_ the _pagenumbers_. thus the file for page 123 might be "0128.png". this didn't surprise me, because d.p. has been naming their scans stupidly for many years... i'd tried to wise them up, but they didn't listen. but it's one thing to name _your_ files stupidly, since you're the only one who works with 'em, so you're the only one who pays the penalties of the big costs that stupid filenames impose. it is quite _another_ thing to name files that you post in public using a stupid convention, because the _public_ works with those files... luckily, the most insane position did not prevail. p.g. required that all scans must be named using the same number as the pagenumber. for a while, anyway, some d.p. people would rename all the scan-files so they could then be posted to p.g. yes, it's stupid to work with stupidly-named files, because you pay all the penalties of working with stupidly-named files, only to rename them to smarter names _after_ you're done working with them, but that's what d.p. was doing. for a little while. until it fizzed. the good news is that most scans at p.g. are named with a number that's the pagenumber. the bad news is that the renaming requirement essentially means not many scans get posted... the ugly news is that the names are _still_not_ really intelligent. they're not _moronic_, but they're not very intelligent either, not at all... on an i.q. scale, they'd weigh in at about 87. thus ends our history lesson to set context... *** ok, what comes next? first of all, let's remember the philosophy that should be a fundamental cornerstone of _any_ intelligent filenaming convention... one important principle (the first?) which should be at work here is that every filename is _unique._ that is, _each_and_every_ file should have a name that identifies _that_file_ separate from all others. now, there might be some cases where the same file might have different names in different places. (some would argue that; let's put that off for now.) but an _iron-clad_rule,_ with _no_ exception, is different files must always have different names. to say it another way, different files must _never_ have the same name. _never_, _never_, _never_. so right at the _very_outset_, the dp/pg model has failed us... all of their files are named with the same p0001.png-p9999.png convention and thus fail to meet the imperative to be _unique._ how can we tell one file named p0001.png from _every_other_ file named p0001.png? we cannot. and since every book has a p0001.png file, _bad_. this isn't rocket-science. it's common sense. _different_files_should_have_different names!_ we're back in the same old boat where we need to pay heed to the subdirectory name to know with certainty which book each file represents. if the filenames were unique, we could place every one of our files in a single subdirectory, and we would have no filename crashes and we could identify each file as a unique entity, just from its name, without looking inside it. i mean, it's great that we know that p0001.png is a scan of a page that was numbered as page 1 in the book in which it appeared, but the filename doesn't tell us _which_ book that was, so we are left out in the cold on the very first step we take. how sad... how utterly and thoroughly pathetic... *** to make my filenames _unique_ to a particular book, i give each scan in a book a 5-letter unique prefix... so, for the "sitka" book we've been analyzing lately, the 5-letter prefix for all the filenames is "sitka"... in case you're wondering, a 5-letter prefix gives us 26**5 possibilities for unique ones, which computes to 11 million possibilities. 11.8 million, to be exact, but some of those might be voided as unusable... if you feel a need to be able to label more books, a 6-letter prefix gives 308,915,776. (308+ million.) a 7-letter prefix gives 8 billion. 8-letter, 208 billion. let me know when you've got 208 billion documents. til then, an 8-letter prefix will work just fine, thanks. indeed, i'm happy with a 5-letter prefix at the moment. *** ok, so let's go on... jim said: > The prefix for the cover pages is: "c". > The prefix for the roman pages is: "f". > The prefix for the arabic pages is: "p". the "c", "f", and "p" convention is one i created... thankfully, this model was adopted by dp/pg. but there was a _reason_ i picked those letters, a good reason, and -- when it came to details -- dp/pg again screwed up with its implementation. the "p" stands for "page", and that's obvious. and "c" for "cover" is the obvious choice too. but some people suggested the front-matter should have an "r" prefix, for "roman numbers". know why i rejected "r" in favor of "f", do you? think about it for a minute, and see if you know. if you said i chose "f" to stand for "front-matter" or "forward-matter", you got an "f" on this quiz. it's a nice mnemonic, sure, but the real reason why i chose "f" is a much more pragmatic one... (know any other words that start with "mne" besides "mnemonic"? so what is its origin?) so, did you think of the answer why i used "f"? to explain why, think back to when i said that -- in coding your app and getting a "map" of the files within any specific book by reading the directory to see what files were there -- a vital component of that strategy will be that the filenames _sort_in_the_order_they_appear._ that is, we need to know not just the files that comprise the book, but their appearance order. so i choose "f" for front-matter pages because those pages appear between "c" and "p" pages -- the cover and the arabic-numbered pages -- so the prefix needed to fall between "c" and "p". and "f" worked just fine. you should also keep in mind that the letters "d" and "e" can be used between "c" and "f", if the idiosyncrasies of a certain book need it. likewise, there are lots of letters that can be used between "f" and "p", if a book needs 'em. and similarly, there are lots of letters _after_ "p" that can be used, for material that might come _after_ regular arabic-sequence "pages". but yeah, that's why i chose "f" instead of "r"... it was so the filenames would _sort_ correctly. *** and speaking along these lines, it's just plain silly that dp/pg pads their pagenumbers to 4 places... the vast majority of books are under 1000 pages, so padding the pagenumber to 3 places works well. that fourth padding place just causes more work. in those rare cases where you have pagenumbers that run in 4 digits, one can summon the "r" prefix to signify those pages, so "r000.png" is page 1000, "r001.png" would be 1001, "r002.png" 1002, etc. (yes, you could use "q" too. but as a general rule, you will leave yourself more flexibility if you do not choose to use prefixes that are directly adjoining.) *** the insanity continues... al says this: > For blank pages there should be no file > and the page number should be skipped. that's just crazy talk. include a blank image-file and name it appropriately, so the world doesn't suspect that you screwed up and dropped a file. because that's _exactly_ what they will suspect... (and with good reason. skipped pages happen, a lot, as the world learned from google's work.) *** ...and it goes on and on... al said: > front cover c0001.png > back cover c0002.png > spine c0003.png um, no. bad idea. very bad idea. you know how i said that the sort-order of the filenames should be identical to their order of appearance, right? so hopefully you understand that the back-cover -- i.e., the last thing in the book -- should have a filename that sorts to the end. not position #2. that's assuming that you even need a back-cover. and the spine? i suppose if you _must_ have it, you will be determined to include it, but please give it a name that sorts it to the end, too, since for most people it will just be a cute little gesture. consider it as the mint as you leave the restaurant. you might also remember that i insisted the files must reflect the recto/verso aspects of the book. for every recto file and filename, there _must_ be a verso file and filename. once again, if you fail to maintain this nicety, the world will suspect that you have lost a file, or that you just do not understand one of the basic structural aspects of the p-book, specifically that every piece of paper has two sides. that's why you always include a blank-page file... ...and why, if you have a file named "c0003.png", you must also have "c0004.png". don't forget it. *** ...and on and on... al said: > page 1 p0001.png > page 2 p0002.png > image on page 2 p0002-image1.png > image on page 2 p0002-image2.png > page 3 p0003.png > page 4 is blank > page 5 p0005.png > ... ... > page 9999 p9999.png first off, you can tell this originated from me, because of the all-lower-case look of it, _but_ i've always padded my numbers to just 3 digits. i believe it was marcello who added that 4th one. (and, as i just explained above, it's unnecessary.) and gee. you know, like jim said, what i propose is really -- at the very heart of it -- a simple system... so it's honestly quite _amazing_ that dp/pg could screw it up in so many different ways. _amazing_. look at the lines there pointing to "image on page 2". either marcello or josh must have added those too. this is something of a nightmare happening here. up to now, the files we've been talking about are _page-scans_. that is, they represent a full page. we all know why that's the case; it's because we are doing _proofing_, so we need the page-scan. now all of a sudden something different pops in, namely "images" contained on the same page as the page-scan (which, of course, is also an image). ok, i won't pretend i don't know what these are. they're higher-resolution versions of _pictures_ that were contained on that page in the p-book. which is all well and good, but let's not mix them in with the page-scans, which is what happens if you name the hi-res files using the same model. give those files names that are _quite_different_, and which sort them completely out of our range. it'd be good if you even stored them in a separate directory. (luckily, this is exactly what p.g. does, storing them in a subdirectory of the .html file, as these "subimages" are used by .html versions; but we certainly don't need 'em to do proofing.) better yet, examine if you need those files at all. if a particular page had a picture on it that needs to be scanned at a higher-resolution, then make the actual page-scan at that higher-resolution... there's no sense having a low-res version of it, especially if it's just going to cause us problems. then, in your e-book file, give instructions for the viewer-program about the coordinates of the scan that represent the picture that you want it to "clip". the viewer-app will then load in the high-res scan, clip out the picture, and then display it accordingly. (ok, this is a little futuristic, since no viewer-apps will do this currently, not even mine. but soon...) *** al said: > page 2 p0002.png > image on page 2 p0002-image1.png > image on page 2 p0002-image2.png one more thing about this. even though, as i mentioned above, these "subimage" filenames have no ill effects, as they're stored elsewhere, there is yet another problem presented here, one which _does_ manifest in the posted scans. you might get the idea, from that list there, that dashes are an ok thing in your filenames. the problem comes with unnumbered pages. let's say we have an unnumbered illustration facing page 36 in our "sitka" book, as we do. so our names would run like this: > sitkap035.png > sitkap036.png > sitkap036a.png > sitkap036b.png > sitkap037.png at least that's how _i_ do it... but if you looked at the policy as al wrote it, you might well conclude the names should be: > sitkap035.png > sitkap036.png > sitkap036-a.png > sitkap036-b.png > sitkap037.png or maybe you'd even think they could be: > sitkap035.png > sitkap036.png > sitkap036-1.png > sitkap036-2.png > sitkap037.png either way, the problem becomes clear if you once again recall that we want the filenames to _sort_ correctly... al's names will sort this way: > sitkap035.png > sitkap036-a.png > sitkap036-b.png > sitkap036.png > sitkap037.png this would cause the viewer-program to believe that it should place that unnumbered illustration between pages 35 and 36 -- a recto and a verso! this illustration either goes between 34 and 35, or it goes between 36 and 37, but that is unclear, and computer programs need things to be clear. *** if you are now asking "why do we need to be concerned with how computer programs will interpret these files?", then you're making the same mistake that the dp/pg people have made. you are failing to grasp the _larger_context_ in which these files will be used. and it is this larger context that is necessary to help us hone the conventions that we adopt in making e-texts. the pagenumber f.a.q. failed to consider the necessary linkage with the names of the scans, and the scanfile-naming rules failed to consider how those scans would be used by developers. this inability -- and unwillingness sometimes -- to see the big picture is why dp/pg isn't creating coherent policies on such matters, even when it actually _tries_ to do so (which is relatively rare). so there implementations will be short-sighted. when you add in the stubborn way that people like al and juliet and marcello and josh _refuse_ to take any advice from me, no matter how good, the situation can look bleak. however, i remain focused on the long-term, where i am confident -- supremely confident -- that my ideas will win. and in the short-term, i just remind myself, on the infrequent occasions when the question will present itself to me, that i am not the stupid one. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 22 16:12:58 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Mar 2010 19:12:58 EDT Subject: [gutvol-d] what do distributed book digitizers want? Message-ID: <243b.53d5c8bd.38d9537a@aol.com> over on his fadedpage forums, rfrank asks "what motivates users?" it's an excellent question, one that deserves a thoughtful answer... so here's my take on it, which i give partly because i have some disagreements with _some_ conclusions that roger has come to. i phrased these all in an affirmative way, so this could be used as a mission statement (e.g., for my own site, or by others), although the reverse-phrasing will often have made more sense to people... (for instance, "i do not want to be asked to do _unnecessary_ work.") i'd guess that most people would approve of most of my items, so i'd be interested to hear if anybody would challenge any of them... *** what i want as a distributed book digitizer... i want to proof, yes. i want to format too. i want to finish pages. i want to finish books. i want to smooth-read. i want to do a great job. i want to do necessary work. i want to select what i work on. i want to have unambiguous rules. i want to know if i am doing a great job. i want to know when i am making mistakes. i want to see solid proof if i've made a mistake. i want to receive fair credit for work i have done. i want to work in a system that's very transparent. i want to work with others who're doing great work. i want to know my energy is being used productively. i want to know how to improve the quality of my work. i want to know exactly what data the system has on me. i want to be able to challenge the system if acts unfairly. i want to let the world know when i have done a great job. i want to let the world know when i have done a lot of work. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Mon Mar 22 18:33:25 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Mar 2010 21:33:25 EDT Subject: [gutvol-d] more sitka smoothreading glitches Message-ID: <4bcd0.498fe527.38d97465@aol.com> i see rfrank is making improvements to his smoothreading version of the "sitka" book. i wasn't sure if he would do that as they came in, or if he would simply wait and do all the fixes one time at the end... > http://www.fadedpage.com/s/sitka/sitka.htm an incremental approach is just fine, but it means that no one has yet reported this next glitch, which is a rather amazing one, since it has survived the system through preprocessing and proofing and postprocessing, although it doesn't even pass a simple spellcheck: > Some looked with extreme disfavor upon the establishment, > while others wrere friendly. it's also unclear whether anyone has reported the inconsistencies in the spelling of the baron's name -- is it wrangel or wrangell? -- but perhaps rfrank decided to leave 'em as they are in the p-book. of course, if _that_ were the case, he wouldn't have changed the two cases of the baron's name on page 43, since they are clearly printed as "wrangel". but also there, two alaskan places which -- as the book directly states there -- "today perpetuate his name" are clearly printed as "wrangell", which is the cause for confusion, compounded by the fact that the name is spelled as "wrangell" on pages 54, 61, 63 (twice), and 102, but as "wrangel" on page 75... aside from the inconsistent-with-the-printed-page instances on page 43, rfrank was also inconsistent with the ink-on-paper on page 63 (the second instance), where he was not just inconsistent with the printed book, but with his own version on the same page. (in other words, the page was consistent itself, but rfrank was not.) *** all of this is not to criticize rfrank. indeed, i will tell you that he is an excellent postprocessor. he has a ton of experience; he's probably submitted over 500 books to p.g. by this time... what this _does_ show is that even an excellent postprocessor, with a ton of experience, can have errors that persist through preprocessing, proofing, and postprocessing, and maybe even through smoothreading. (at least this far, these glitches have.) so i think this is good evidence that "once and done" is _not_ a good strategy for a roundless system. that philosophy has _never_ been a part of the roundless system that _i_ preach... indeed, i believe any change should be reviewed and approved by two separate people before it is considered to be "golden"... it's also important to remind ourselves that we are not "short" of proofers. to the contrary, we have a huge _glut_ of proofers. distributed proofreaders has so many proofers that they are now actively considering ways to _throttle_ their p1 proofers! with an _abundance_ of proofers, there is no need to scrimp... we can have multiple proofers look at every page in every book. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Mon Mar 22 18:41:09 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Tue, 23 Mar 2010 02:41:09 +0100 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: <23a5c.75e8cc64.38d9413f@aol.com> References: <23a5c.75e8cc64.38d9413f@aol.com> Message-ID: BB I can not believe you are serious. 1) Your critic fails all logic. Why in Gods name would anybody intermix scans from more than one book in the same directory. Their are more than enough files just from one book ! 2) How is a sequence of five arbitary characters anymore informative. Or can you remeber 26^5 titles. Come On Man! Wake up. regards Keith. Am 22.03.2010 um 22:55 schrieb Bowerbird at aol.com: [snip, snip hot air deleted] From Bowerbird at aol.com Tue Mar 23 00:26:05 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Mar 2010 03:26:05 EDT Subject: [gutvol-d] Re: save those pagenumber references Message-ID: <705ad.3f20837b.38d9c70d@aol.com> keith said: > BB I can not believe you are serious. is that so? because i find your disbelief to be quite humorous! :+) > 1) Your critic fails all logic. it fails _all_ logic? i have a hard time believing that, keith... :+) > Why in Gods name would anybody intermix scans > from more than one book in the same directory. > Their are more than enough files just from one book ! i wrote that huge post, and _that's_ what you took from it? talk about missing the point. you missed it by a mile, keith. (a mile is about 1.6 kilometers, in case you are wondering.) for the record, not that i think anyone else missed the point, it might not be that you'd _want_ to put more than one book in a directory, it's that you _could_ if you ever _did_ want to, whereas, when all books are named p001-p999, you cannot. the more important point is that, given the files for a book, and for another book, you wanna be able to tell them apart. all files for a book should be named with a common element. and the name of every file should be unique from all others, across your entire system. this is nothing but common sense. > 2) How is a sequence of five arbitary characters anymore > informative. Or can you remeber 26^5 titles. the characters are not informative in and of themselves, but they become meaningful when all files from a book receive the same prefix, because then you see, just from the names, they go together. and no, there's no need to remember them, since the catalog will keep all of the information straight and make the appropriate information available to the end-users. but i suspect you knew all that. *** but in order to see how someone might do it another way, go look at the internet archive and their naming convention. they went for longer names, hoping for _some_ meaning... and, to a degree, they attained it, at a cost in convenience. for instance, here's a subdirectory name: > http://www.archive.org/details/adventuresoftoms00twaiiala that subdirectory maps onto another more-specific one: > http://ia331317.us.archive.org/1/items/adventuresoftoms00twaiiala/ so their "name" for this book is "adventuresoftoms00twaiiala". therefore, you might guess -- correctly -- that this book is "the adventures of tom sawyer". but it doesn't inform you _which_ edition of the book this is, or where it came from, or if it is one of the several copies from project gutenberg, or when it was published, or any number of details about it. to get to that information, you'll have to visit their catalog, and if you're gonna visit a catalog anyway, you might as well visit the catalog to find out the 5-letter "prefix" of the book, a prefix that's much easier than "adventuresoftoms00twaiiala". and you better believe me, because it has happened to me, once you get a lot of the archive.org files on your machine, it starts to become very hard to discriminate names such as: > http://www.archive.org/details/adventuresoftoms00twaiiala > http://www.archive.org/details/theadventuresoft00074gut > http://www.archive.org/details/theadventuresoft07193gut > http://www.archive.org/details/theadventuresoft07194gut > http://www.archive.org/details/adventurestomsa02twaigoog > http://www.archive.org/details/adventurestomsa00twaigoog > http://www.archive.org/details/adventurestomsa00willgoog > http://www.archive.org/details/adventurestomsa01twaigoog > http://www.archive.org/details/adventurestomsa05twaigoog > http://www.archive.org/details/tomsawyer00twain > http://www.archive.org/details/adventuresoftoms20twai > http://www.archive.org/details/adventuresoftoms99twai > http://www.archive.org/details/adventuresoftoms00twai2 > http://www.archive.org/details/tomsawyeradv00twairich > http://www.archive.org/details/advtomsawyer00twairich > http://www.archive.org/details/booki-export-the-adventures-of-tom-sawyer so, for me anyway, a 5-letter prefix seems to do the job just fine. *** likewise, we can look at the system used by project gutenberg, where the "prefix" for the book is essentially its 5-digit name. digits are, in some ways, even more convenient that characters. the problem is, 5-digit names only work up to 99,999 books... that's enough for now, for project gutenberg, so that's fine, but i wanted more breathing room, so i chose 5-character names... *** or let's take a look at youtube names. here's a sample u.r.l.: > http://www.youtube.com/watch?v=sA_0cvd1EUM > http://www.youtube.com/watch?v=qybUFnY7Y8w first, i'm not sure why they need that "watch" in every u.r.l. surely "watching" a video would be the default action, not?, so it seems to me they could have abstracted that out, but... we find they're using an 11=character name, one that uses _both_ uppercase and lowercase letters (i only use lowercase), _and_ numbers, _and_ at least some other characters as well. that's going to give them _many_trillions_ of possible names, which i guess is how high you think if you sell for $1.6billion. *** speaking of google, let's see their book filename convention: > http://www.google.com/books?id=3n4hAAAAMAAJ > http://www.google.com/books?id=Y7sOAAAAIAAJ they've got a 12-character name, uppercase and lowercase, plus numbers. which, again, will accommodate lots of files. *** > Come On Man! Wake up. well, it's after midnight my time, so i'm about to go to sleep; but i will wake up tomorrow morning, all ready to post again. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Tue Mar 23 03:10:46 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Tue, 23 Mar 2010 11:10:46 +0100 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: <705ad.3f20837b.38d9c70d@aol.com> References: <705ad.3f20837b.38d9c70d@aol.com> Message-ID: <8D75E262-82B9-447A-9C3A-6F124822F56F@uni-trier.de> Hi BB, I did get your point and they due have there merit, yet no more than any other filenaming convention where you overly compress the names. I will not either go into how flawed they are. If you want telling filenames use them. We are not living in a DOS world where we are limited to 8 characters. Proog given that your naming convention is flawed and so now you can change it !! regards Keith. Am 23.03.2010 um 08:26 schrieb Bowerbird at aol.com: > keith said: > > BB I can not believe you are serious. > > is that so? because i find your disbelief to be quite humorous! :+) > > > > 1) Your critic fails all logic. > > it fails _all_ logic? i have a hard time believing that, keith... :+) > > > > Why in Gods name would anybody intermix scans > > from more than one book in the same directory. > > Their are more than enough files just from one book ! > > i wrote that huge post, and _that's_ what you took from it? > > talk about missing the point. you missed it by a mile, keith. > (a mile is about 1.6 kilometers, in case you are wondering.) > > for the record, not that i think anyone else missed the point, > it might not be that you'd _want_ to put more than one book > in a directory, it's that you _could_ if you ever _did_ want to, > whereas, when all books are named p001-p999, you cannot. > > the more important point is that, given the files for a book, > and for another book, you wanna be able to tell them apart. > all files for a book should be named with a common element. > and the name of every file should be unique from all others, > across your entire system. this is nothing but common sense. > > > > 2) How is a sequence of five arbitary characters anymore > > informative. Or can you remeber 26^5 titles. > > the characters are not informative in and of themselves, but > they become meaningful when all files from a book receive > the same prefix, because then you see, just from the names, > they go together. and no, there's no need to remember them, > since the catalog will keep all of the information straight and > make the appropriate information available to the end-users. > > but i suspect you knew all that. > > *** > > but in order to see how someone might do it another way, > go look at the internet archive and their naming convention. > > they went for longer names, hoping for _some_ meaning... > and, to a degree, they attained it, at a cost in convenience. > > for instance, here's a subdirectory name: > > http://www.archive.org/details/adventuresoftoms00twaiiala > > that subdirectory maps onto another more-specific one: > > http://ia331317.us.archive.org/1/items/adventuresoftoms00twaiiala/ > > so their "name" for this book is "adventuresoftoms00twaiiala". > > therefore, you might guess -- correctly -- that this book is > "the adventures of tom sawyer". but it doesn't inform you > _which_ edition of the book this is, or where it came from, > or if it is one of the several copies from project gutenberg, > or when it was published, or any number of details about it. > to get to that information, you'll have to visit their catalog, > and if you're gonna visit a catalog anyway, you might as well > visit the catalog to find out the 5-letter "prefix" of the book, > a prefix that's much easier than "adventuresoftoms00twaiiala". > > and you better believe me, because it has happened to me, > once you get a lot of the archive.org files on your machine, > it starts to become very hard to discriminate names such as: > > http://www.archive.org/details/adventuresoftoms00twaiiala > > http://www.archive.org/details/theadventuresoft00074gut > > http://www.archive.org/details/theadventuresoft07193gut > > http://www.archive.org/details/theadventuresoft07194gut > > http://www.archive.org/details/adventurestomsa02twaigoog > > http://www.archive.org/details/adventurestomsa00twaigoog > > http://www.archive.org/details/adventurestomsa00willgoog > > http://www.archive.org/details/adventurestomsa01twaigoog > > http://www.archive.org/details/adventurestomsa05twaigoog > > http://www.archive.org/details/tomsawyer00twain > > http://www.archive.org/details/adventuresoftoms20twai > > http://www.archive.org/details/adventuresoftoms99twai > > http://www.archive.org/details/adventuresoftoms00twai2 > > http://www.archive.org/details/tomsawyeradv00twairich > > http://www.archive.org/details/advtomsawyer00twairich > > http://www.archive.org/details/booki-export-the-adventures-of-tom-sawyer > > so, for me anyway, a 5-letter prefix seems to do the job just fine. > > *** > > likewise, we can look at the system used by project gutenberg, > where the "prefix" for the book is essentially its 5-digit name. > > digits are, in some ways, even more convenient that characters. > the problem is, 5-digit names only work up to 99,999 books... > that's enough for now, for project gutenberg, so that's fine, but > i wanted more breathing room, so i chose 5-character names... > > *** > > or let's take a look at youtube names. here's a sample u.r.l.: > > http://www.youtube.com/watch?v=sA_0cvd1EUM > > http://www.youtube.com/watch?v=qybUFnY7Y8w > > first, i'm not sure why they need that "watch" in every u.r.l. > surely "watching" a video would be the default action, not?, > so it seems to me they could have abstracted that out, but... > > we find they're using an 11=character name, one that uses > _both_ uppercase and lowercase letters (i only use lowercase), > _and_ numbers, _and_ at least some other characters as well. > that's going to give them _many_trillions_ of possible names, > which i guess is how high you think if you sell for $1.6billion. > > *** > > speaking of google, let's see their book filename convention: > > http://www.google.com/books?id=3n4hAAAAMAAJ > > http://www.google.com/books?id=Y7sOAAAAIAAJ > > they've got a 12-character name, uppercase and lowercase, > plus numbers. which, again, will accommodate lots of files. > > *** > > > Come On Man! Wake up. > > well, it's after midnight my time, so i'm about to go to sleep; > but i will wake up tomorrow morning, all ready to post again. > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 23 12:09:47 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Mar 2010 15:09:47 EDT Subject: [gutvol-d] Re: save those pagenumber references Message-ID: <5bee0.5e4a912e.38da6bfb@aol.com> keith said: > I did get your point and they due have there merit, > yet no more than any other filenaming convention > where you overly compress the names. what do you mean by "overly compress the names"? what would a "noncompressed" filename look like? > I will not either go into how flawed they are. ...because you have no arguments of substance... > If you want telling filenames use them. again, what does this _mean_? > We are not living in a DOS world > where we are limited to 8 characters. the history of the u.r.l. in terms of its length, is rather interesting. everybody started with an ethic that they should be short and punchy. not just for convenience, but memorability too. gradually the u.r.l. began accumulating length, as websites got more extensive and files were segmented into subdirectories for convenience. then google started giving juice for content words in the u.r.l., and the length zoomed ridiculously, as everyone employed long names for s.e.o. purposes. things got so ludicrous that we had the emergence of u.r.l. "shorteners", web services that promised to end the scourge of a long u.r.l. by providing a much shorter one they maintained which rerouted people to the longer original, _plus_ furnished some stats, so you knew where the clicks were coming from, etc. what happened then was that twitter hit, and hit big. all of a sudden, people faced a 140-character limit. they didn't want to "waste" a substantial percentage of that limit every time they wanted to send a u.r.l., so the demand for shortener services skyrocketed... so before we could turn around, there were dozens such services, and not just 2 or 3 (bit.ly and tinyurl), and things got messy. first, the shortened u.r.l. is a pain in the ass for many people, because tweeters will often provide different shortened versions for the same long u.r.l., but your browser doesn't show them as already-visited links (since technically they _are_ different links, and your browser doesn't know that they all point to the same eventual destination). second, shortener services make the u.r.l. "brittle"... if the shortener service breaks down, so does their "rerouting" ability which points to the ultimate site, causing all those links to break for no good reason. as startups, with very little chance of "making it", the original shorteners had frequent down-time, so the problem was readily apparent, even then... but as more and more of these services started up -- hoping to hit the lottery by being "blessed" by twitter or google or anyone who would buy them for a boatload of money -- it was more and more clear that most of these services _would_fail_, and take all their short links with them when they did. and sure enough, then they did start closing down. and they continue to have cutbacks, to this very day. one of them -- http://tr.im/ -- just announced that it is no longer accepting u.r.l. shortening requests... luckily, they're still honoring their current redirects; but what happens when they go completely under? well, we're lucky once again, because google has come to the rescue. they have ensured that they will support a service designed to honor redirects for any shortener service that goes out of business. it makes sense, since they have a large degree of responsibility for this problem in the first place, since they give extra google juice to a long u.r.l. thankfully, though, the shortener services made us admit to ourselves that the long u.r.l. is a problem, bringing us to the current stage of u.r.l. history, where we are once again embracing the short u.r.l. many people are now voluntarily cutting back on the use of the long u.r.l.; google could help this effort by reversing its policy to give juice to the long u.r.l. because a short and clear u.r.l. is a better u.r.l. because people _do_ have to occasionally type in a u.r.l., and can't just do a simple copy-and-paste. because people often include u.r.l. in listserve posts, where there is an imposed length on the lines, and u.r.l. get printed in p-books, with limited line-length. because people tweet u.r.l. because people dislike the brittle shortened u.r.l. so that's why i think my 5-letter prefix works just fine. > Proog given that your naming convention is flawed > and so now you can change it !! huh? what? i guess you better run that by me again. no, on second thought, never mind. this is a great example about here discussion here is one big waste. i don't think you're _trying_ to sidetrack the dialog, keith, so i'm not going to scold you, but just tell you that you need to keep things moving _forward_, ok? -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From schultzk at uni-trier.de Tue Mar 23 14:56:17 2010 From: schultzk at uni-trier.de (Keith J. Schultz) Date: Tue, 23 Mar 2010 22:56:17 +0100 Subject: [gutvol-d] Re: save those pagenumber references In-Reply-To: <5bee0.5e4a912e.38da6bfb@aol.com> References: <5bee0.5e4a912e.38da6bfb@aol.com> Message-ID: Hi BB, I can not help you if you do not understand plain english or american for that matter. Also, since you are unable to stay with any point I say good-bye. regards Keith. Am 23.03.2010 um 20:09 schrieb Bowerbird at aol.com: > keith said: > > I did get your point and they due have there merit, > > yet no more than any other filenaming convention > > where you overly compress the names. > > what do you mean by "overly compress the names"? > what would a "noncompressed" filename look like? > > > > I will not either go into how flawed they are. > > ...because you have no arguments of substance... > > > > If you want telling filenames use them. > > again, what does this _mean_? > > > > We are not living in a DOS world > > where we are limited to 8 characters. > > the history of the u.r.l. in terms of its length, > is rather interesting. everybody started with > an ethic that they should be short and punchy. > not just for convenience, but memorability too. > > gradually the u.r.l. began accumulating length, > as websites got more extensive and files were > segmented into subdirectories for convenience. > > then google started giving juice for content words > in the u.r.l., and the length zoomed ridiculously, as > everyone employed long names for s.e.o. purposes. > > things got so ludicrous that we had the emergence > of u.r.l. "shorteners", web services that promised to > end the scourge of a long u.r.l. by providing a much > shorter one they maintained which rerouted people > to the longer original, _plus_ furnished some stats, > so you knew where the clicks were coming from, etc. > > what happened then was that twitter hit, and hit big. > > all of a sudden, people faced a 140-character limit. > they didn't want to "waste" a substantial percentage > of that limit every time they wanted to send a u.r.l., > so the demand for shortener services skyrocketed... > > so before we could turn around, there were dozens > such services, and not just 2 or 3 (bit.ly and tinyurl), > and things got messy. first, the shortened u.r.l. is > a pain in the ass for many people, because tweeters > will often provide different shortened versions for > the same long u.r.l., but your browser doesn't show > them as already-visited links (since technically they > _are_ different links, and your browser doesn't know > that they all point to the same eventual destination). > > second, shortener services make the u.r.l. "brittle"... > if the shortener service breaks down, so does their > "rerouting" ability which points to the ultimate site, > causing all those links to break for no good reason. > > as startups, with very little chance of "making it", > the original shorteners had frequent down-time, > so the problem was readily apparent, even then... > > but as more and more of these services started up > -- hoping to hit the lottery by being "blessed" by > twitter or google or anyone who would buy them > for a boatload of money -- it was more and more > clear that most of these services _would_fail_, and > take all their short links with them when they did. > > and sure enough, then they did start closing down. > > and they continue to have cutbacks, to this very day. > one of them -- http://tr.im/ -- just announced that > it is no longer accepting u.r.l. shortening requests... > luckily, they're still honoring their current redirects; > but what happens when they go completely under? > > well, we're lucky once again, because google has > come to the rescue. they have ensured that they > will support a service designed to honor redirects > for any shortener service that goes out of business. > > it makes sense, since they have a large degree of > responsibility for this problem in the first place, > since they give extra google juice to a long u.r.l. > > thankfully, though, the shortener services made us > admit to ourselves that the long u.r.l. is a problem, > bringing us to the current stage of u.r.l. history, > where we are once again embracing the short u.r.l. > > many people are now voluntarily cutting back on the > use of the long u.r.l.; google could help this effort > by reversing its policy to give juice to the long u.r.l. > > because a short and clear u.r.l. is a better u.r.l. > > because people _do_ have to occasionally type in > a u.r.l., and can't just do a simple copy-and-paste. > > because people often include u.r.l. in listserve posts, > where there is an imposed length on the lines, and > u.r.l. get printed in p-books, with limited line-length. > > because people tweet u.r.l. > > because people dislike the brittle shortened u.r.l. > > so that's why i think my 5-letter prefix works just fine. > > > > Proog given that your naming convention is flawed > > and so now you can change it !! > > huh? what? i guess you better run that by me again. > > no, on second thought, never mind. this is a great > example about here discussion here is one big waste. > > i don't think you're _trying_ to sidetrack the dialog, > keith, so i'm not going to scold you, but just tell you > that you need to keep things moving _forward_, ok? > > -bowerbird > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 23 15:02:27 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Mar 2010 18:02:27 EDT Subject: [gutvol-d] Re: save those pagenumber references Message-ID: <63e12.3dbb24c5.38da9473@aol.com> well, i guess i'll never know what an "uncompressed" filename would look like, or what a "telling filename" could possibly be... my loss, apparently... :+) -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowerbird at aol.com Tue Mar 23 17:14:57 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Mar 2010 20:14:57 EDT Subject: [gutvol-d] Re: his style grates on many and his egotism seems boundless Message-ID: <6b857.3980f0bc.38dab381@aol.com> over on his fadedpage forums, rfrank said: > For those of you that read gutvol-d, > you should know that I do, also. ok. that clears that up, for anyone who was wondering. > Bowerbird seems very interested in the work > we are doing here you're darn tooting i am. :+) rfrank is currently doing the most interesting work in the whole arena of book distribution, and that's an arena where i have exhibited keen interest for a very long time, certainly long before rfrank ever became involved with it. > Bowerbird seems very interested in > the work we are doing here and posts his > observations and suggestions of the gutvol-d list. well, yeah, i've been posting on this listserve since 2003, except for that short time when the attack-pack got me "moderated", and i went on strike until it was rescinded. so again, nothing new there... > Though his style grates on many > and his egotism seems boundless, ok, let's break this down, shall we? > his style grates on many yes it does. and this is particularly so for those people who subscribe to the dale-carnegie school-of-thought on "how to win friends and influence people", of which i do firmly believe that rfrank is a very big follower... (and al haines too, as well as juliet sutherland.) so let me be perfectly clear on this matter once again. i _hate_ the dale-carnegie philosophy. with a passion. i consider it to be tremendously duplicitous, in that it zeros in on one of the most pathetic of human traits -- our insecurity about our worth -- and trades on it. it encourages one to feed people positive reinforcement, so as to make them feel good and overcome insecurity, so they will come to like you and be influenced by you... i'm not denying that it _works_. it works all too well! but it's cynical. and it's manipulative. and it's ugly. it tip-toes around the issue of flagrant dishonesty by informing its adherents to strive to be honest, and not to lie outright, but that's largely a cover-up which denies the fact one can't _always_ be positive, not if one feels any solid commitment to the truth, the whole truth, and nothing but the truth, as i do. so i will often go out of my way to do reverse-carnegie. dale says "never tell someone they are wrong". so when i am thoroughly convinced someone is wrong, when it's true, i say it, and with gusto: "you are wrong." and then i give all of the reasons _why_ they are wrong, which is also a reverse-carnegie, because dale says that you should always give people an out, a way to save face. those are just two quick examples. but that's enough, because we're not really here to talk about dale carnegie. the thing to remember is, i don't give a crap if anyone here becomes my "friend" or not. i have enough friends, and i don't even _want_ friends who i can't be honest with. and i'm not here attempting to "influence" anyone either. a lot of people get confused about this, because i'm often saying that things _should_be_done_ in a certain way, so people think i have some kind of personal interest about actually _having_ them done that way. i don't really care. you can do it however you want. because when i talk about "how something should be done", i'm talking about _logic_. i'm talking about the _arguments_ that dictate that decision. as shown here, frequently, a lot of people here don't seem to care about "logic" and "reasoned decisions" and stuff like that. which is fine by me. please make decisions however you want. the thing is, it really pisses off the carnegie adherents when you don't care whether you influence them or not, probably because they are willing to sell their soul to have influence, so your apathy (or hostility) about it contradicts their values. so when you fail to butter them up before you lobby them, like dale advises, they get all offended, and even _mean_... (that's right, they forget dale's advice to always be nice, which just goes to show they didn't absorb it very deeply; they only use it because it often works on a surface level.) so, yeah, my style "grates" on some people. so what? because a whole lot of other people -- who i actually like a lot better -- actually _appreciate_ and _respect_ someone who is willing to speak their mind honestly... > and his egotism seems boundless that's just a silly projection. i'm a humble person. i am honestly and truly humble. i'm unimposing, and i'm tremendously kind and gentle. and it's not just a phony act i put on to "win friends"... but there is something about truth. when you have truth on your side, you become strong. you become invincible. i work -- hard! -- to make sure i get to the bottom of a situation, and consider every angle, because it is _vitally_ important to me that i have truth on my side. if i'm on one side of an issue, and the strength of the argumentation suddenly flips truth to the _other_ side, i flip right along with it. because truth is important... yes, one of my biggest flaws is saying "i told you so." but one of my biggest assets is that i have absolutely no reluctance, at all, to say "i was wrong" when i was. a lot of people think i'm "egotistical" when i'm _really_ just extremely confident that i have truth on my side... so it actually has nothing to do with _me_, or my _ego_. instead, it has _everything_ to do with _truth_... > Though his style grates on many > and his egotism seems boundless, > at times there is something of worth in what he posts. of course there is. that's because i have truth on my side. it's also because i'm enough of a scientist that i'm willing -- nay, _eager_ -- to listen when someone says they think that i'm wrong. because if they're correct that i am wrong, i _want_ them to show me the light, so i can switch sides... but again, i don't really care if i "convince" anyone or not. it's an intellectual exercise for me, not a power struggle... > at times there is something of worth in what he posts. oh yeah, and the _other_ thing is that you can never trust a carnegie follower when they say anything nice about you, because they're probably just attempting to butter you up. so maybe roger doesn't even _believe_ what he said there. > He mistakenly reported that the SR version of the book > is being incrementally updated. well, the file that is now posted on your site, to which i gave the u.r.l., is _not_ the file that was posted yesterday. the pagination error i pointed out yesterday was corrected. so i'm not sure how you can use the term "mistakenly"... > He also shows he hasn't come to a complete understanding > of the unusual situation in the text regarding the > inconsistent usage of Wrangel and Wrangell spellings > as it applies to Barons, islands and native population. > It still isn't right and will be bimodally normalized > after smoothreading completes. i didn't really try to "come to a complete understanding". the p-book appears to me to be inconsistent in its usage, and you appear to be inconsistent too, and your usage does not achieve consistency with the p-book's usage... i pointed out the inconsistencies to show i'd found them. but there's no payoff for me to do any more work on that. > He did, however, correctly spot the effect of a superfluous > page transition marker after the last illustration > on a numbered page in the book. Since these books are > all generated from one source file, it was a simple fix > and it was regenerated in a heartbeat. ok. so the file that's up online was _not_ "updated", but it _was_ "regenerated". i'll try to remember this terminology. > He also believes that I may have post-processed > over 500 books, and I have not. well, i'd rather give rfrank _more_ credit than _less_... i know he's done _hundreds_and_hundreds_ of books. he's also programmed a lot of tools, and is now running the roundless experimental site, plus he's on the board at d.p., so it's clear that he's doing a lot, and i give him credit for it. > Though I could have a lot to say about his posts, I choose > not to engage him for historical and practical reasons. the "historical" reason might be that when he did engage me, he tried to deny reality, so i rubbed his nose in it, just like you rub a dog's nose in his pee when he urinates in your house... and the "practical" reason might be that he knows i will do that again if he tries to deny reality again, carnegie notwithstanding. but hey, i don't need for us to "engage". i'm self-motivated. i will say what i have to say, whether anyone listens or not... so he can say what he wants on his board, and i'll post here. -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: