From kionon at animemusicvideos.org Sat Mar 1 10:35:08 2008 From: kionon at animemusicvideos.org (Kionon) Date: Sun, 2 Mar 2008 03:35:08 +0900 Subject: [gutvol-d] The Old Fashioned Way... Message-ID: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> List, If I wish to add a public domain book to the project, and I actually desire to type it up by hand, is there any reason why I can't? Very respectfully, Kevin M. Callahan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080302/8a5513c7/attachment.htm From klofstrom at gmail.com Sat Mar 1 10:56:47 2008 From: klofstrom at gmail.com (Karen Lofstrom) Date: Sat, 1 Mar 2008 08:56:47 -1000 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> Message-ID: <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> On 3/1/08, Kionon wrote: > If I wish to add a public domain book to the project, and I actually desire > to type it up by hand, is there any reason why I can't? Unfortunately, you can. I say "unfortunately" because it is close to certain that you are going to produce a flawed text. Since Project Gutenberg, at present, doesn't have any quality controls, it will accept your flawed text. Why flawed? The more work I do at Distributed Proofreaders -- and in commercial publishing -- the clearer it is that it takes more than one pair of eyes to produce a good text. A second person will catch what the first person missed. No matter how good the first person is. That's why Distributed Proofreaders now subjects most (but not all) of the texts it produces to three rounds of human proofreading. Particularly easy projects may be done in two rounds. DP recently re-did a book that had been done in the early days of Project Gutenberg. The post-processor checked for differences between the early, typed-in text and the later DP effort. There were 44 differences. 44 that were all errors in the earlier text. If you find the type-in process involving and soothing, DP does do "type-in" projects. These are texts that cannot be OCRed, usually very old books in antique typefaces. Instead of trying to do it on your own, come join the community at Distributed Proofreaders. We make better books and we (usually) have a lot of fun doing it. If you participate in the community forum, the bulletin board, you will meet lots of bright, bookish, and delightfully eccentric people. -- Zora From kionon at animemusicvideos.org Sat Mar 1 11:08:26 2008 From: kionon at animemusicvideos.org (Kionon) Date: Sun, 2 Mar 2008 04:08:26 +0900 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> Message-ID: <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> > Unfortunately, you can. I say "unfortunately" because it is close to > certain that you are going to produce a flawed text. Since Project > Gutenberg, at present, doesn't have any quality controls, it will > accept your flawed text. Hrm. > Why flawed? The more work I do at Distributed Proofreaders -- and in > commercial publishing -- the clearer it is that it takes more than one > pair of eyes to produce a good text. A second person will catch what > the first person missed. No matter how good the first person is. > That's why Distributed Proofreaders now subjects most (but not all) of > the texts it produces to three rounds of human proofreading. > Particularly easy projects may be done in two rounds. Oh, I am most certainly aware of this. My background is in journalism, politics, and philosophy. I always made it a habit to read documents backwards, but always had other eyes looking over the text as well. > If you find the type-in process involving and soothing, DP does do > "type-in" projects. These are texts that cannot be OCRed, usually very > old books in antique typefaces. Well, there certainly is the interest in process in general. However, I also wanted to do works that were of personal importance to me. There is something far more magical, I would think, with a text that impacted you in a such a way that you would wish to do a rather labor intensive transcription. That, and of course, I lack a scanner, and am not located within a reasonable radius of a location with which to obtain one (I am, in fact, not even in an English speaking country). > Instead of trying to do it on your own, come join the community at > Distributed Proofreaders. We make better books and we (usually) have a > lot of fun doing it. If you participate in the community forum, the > bulletin board, you will meet lots of bright, bookish, and > delightfully eccentric people. I had intended to do that, certainly, but again I point to the fact I wish to work on projects that have had a particular impact on me; we are far more wont, it should not surprise you, to preserve that which we are fond of. Very respectfully, Kevin M. Callahan From grythumn at gmail.com Sat Mar 1 11:13:36 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Sat, 1 Mar 2008 14:13:36 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> Message-ID: <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com> On Sat, Mar 1, 2008 at 2:08 PM, Kionon wrote: > That, and of course, I lack a scanner, and am not located within a > reasonable radius of a location with which to obtain one (I am, in > fact, not even in an English speaking country). > You have to have access to a scanner (or find a scan online of the same edition) in order to use the copyright clearance process at copy.pglaf.org. I think there is an older process involving mailing in photocopies of the title/verso to MH, but I have never used it... R C -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080301/04c9c9c7/attachment.htm From kionon at animemusicvideos.org Sat Mar 1 11:15:19 2008 From: kionon at animemusicvideos.org (Kionon) Date: Sun, 2 Mar 2008 04:15:19 +0900 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com> Message-ID: <8893d7a30803011115v5d6f816evfa0a4f73d78f6931@mail.gmail.com> > You have to have access to a scanner (or find a scan online of the same > edition) in order to use the copyright clearance process at copy.pglaf.org. > I think there is an older process involving mailing in photocopies of the > title/verso to MH, but I have never used it... > That would seem to present a problem. Very respectfully, Kevin M. Callahan From grythumn at gmail.com Sat Mar 1 11:58:02 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Sat, 1 Mar 2008 14:58:02 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <8893d7a30803011115v5d6f816evfa0a4f73d78f6931@mail.gmail.com> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com> <8893d7a30803011115v5d6f816evfa0a4f73d78f6931@mail.gmail.com> Message-ID: <15cfa2a50803011158l2ff0b112i75af06dc3a7c7399@mail.gmail.com> On Sat, Mar 1, 2008 at 2:15 PM, Kionon wrote: > > You have to have access to a scanner (or find a scan online of the same > > edition) in order to use the copyright clearance process at > copy.pglaf.org. > > I think there is an older process involving mailing in photocopies of > the > > title/verso to MH, but I have never used it... > > > > That would seem to present a problem. > How about access to a digital camera? Even an old webcam might do, if you take pictures of several segments and stitch them together. We're only talking 2-4 pages. R C -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080301/bdf4c8ed/attachment.htm From ajhaines at shaw.ca Sat Mar 1 12:12:44 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sat, 01 Mar 2008 12:12:44 -0800 Subject: [gutvol-d] The Old Fashioned Way... References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> Message-ID: <001e01c87bd8$9c7e89b0$6501a8c0@ahainesp2400> As long as you've gotten a copyright clearance as per PG's How-To's, you can pretty much do what you want. BUT... If you do this book by hand, be prepared to proof it almost word by word. It's too easy for eyes to jump words, skip lines, and make all manner of human mistakes. Type a page, proof a page is a good rule. When you've done an entire chapter, put it aside for a few days so it's stale to your memory of it, then proof it again. Then, get someone else to do another proof. If you haven't already, check out the PG How-To and FAQ links at PG's main page http://www.gutenberg.org/wiki/Main_Page. I don't know what word processor you're using, but try to keep the lines the same length as they are in the original book. If possible, put a soft return at the end of each line (in MS-Word, that's done with Shift-Enter). (But use a hard return at paragraph end.) That way, you can do a line count of each typed page, and if that doesn't match the book, you've done something wrong. The soft returns can be dealt with when the chapter is complete. Save each chapter as a separate file, in two formats--your word processor's native format and as a standard text file. Run Gutcheck, Jeebies, and Gutspell on the text version, and fix problems in the native version. They're available at http://gutcheck.sourceforge.net/ and are invaluable for finding typos, scannos, etc, etc, but they work only on text files. Don't forget to do a spellcheck with whatever word processor you're using. If the book has footnotes, type them at the bottom of their respective page and leave them there until the page is thoroughly proofed. When the chapter is complete, they can be handled as per PG guidelines. (I renumber them sequentially and move them to the end of their chapter.) I speak from a certain amount of experience--several years ago I proofed a 450-page book that someone had spent three years typing by hand. (They had sent me the book to proof against.) The person had done a reasonable job, all things considered, but I still found a number of problems per page. The whole exercise took me about six weeks, and I'm certain that there are still problems in the posted version, just because of the density and complexity of the book. Several months later, I got hold of the book's follow-on volume, which had about the same complexity, and produced it in about two weeks, clearance to submission, with a much cleaner result, simply because I started from a scanned text, not a hand-typed text. A follow-up on Robert's comment re getting a clearance: Michael Hart's address is in this How-To: http://www.gutenberg.org/wiki/Gutenberg:Copyright_How-To. Al ----- Original Message ----- From: "Karen Lofstrom" To: "Project Gutenberg Volunteer Discussion" Sent: Saturday, March 01, 2008 10:56 AM Subject: Re: [gutvol-d] The Old Fashioned Way... > On 3/1/08, Kionon wrote: > >> If I wish to add a public domain book to the project, and I actually desire >> to type it up by hand, is there any reason why I can't? > > Unfortunately, you can. I say "unfortunately" because it is close to > certain that you are going to produce a flawed text. Since Project > Gutenberg, at present, doesn't have any quality controls, it will > accept your flawed text. > > Why flawed? The more work I do at Distributed Proofreaders -- and in > commercial publishing -- the clearer it is that it takes more than one > pair of eyes to produce a good text. A second person will catch what > the first person missed. No matter how good the first person is. > That's why Distributed Proofreaders now subjects most (but not all) of > the texts it produces to three rounds of human proofreading. > Particularly easy projects may be done in two rounds. > > DP recently re-did a book that had been done in the early days of > Project Gutenberg. The post-processor checked for differences between > the early, typed-in text and the later DP effort. There were 44 > differences. 44 that were all errors in the earlier text. > > If you find the type-in process involving and soothing, DP does do > "type-in" projects. These are texts that cannot be OCRed, usually very > old books in antique typefaces. > > Instead of trying to do it on your own, come join the community at > Distributed Proofreaders. We make better books and we (usually) have a > lot of fun doing it. If you participate in the community forum, the > bulletin board, you will meet lots of bright, bookish, and > delightfully eccentric people. > > -- > Zora > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From steven at desjardins.org Sat Mar 1 19:47:50 2008 From: steven at desjardins.org (Steven desJardins) Date: Sat, 1 Mar 2008 22:47:50 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> Message-ID: <41fd8970803011947p7a63350dh8673b7bddec00048@mail.gmail.com> On Sat, Mar 1, 2008 at 2:08 PM, Kionon wrote: > > Instead of trying to do it on your own, come join the community at > > Distributed Proofreaders. We make better books and we (usually) have a > > lot of fun doing it. If you participate in the community forum, the > > bulletin board, you will meet lots of bright, bookish, and > > delightfully eccentric people. > > I had intended to do that, certainly, but again I point to the fact I > wish to work on projects that have had a particular impact on me; we > are far more wont, it should not surprise you, to preserve that which > we are fond of. It's possible to make such a contribution at Distributed Proofreaders, even without a scanner or OCR software. There are several sites, like Google Book Search and the Internet Archive, which have scans of public domain books. If you find one of the books you want to preserve on one of these sites (and you check to make sure it has no missing pages), then you should be able to find someone who will OCR the files for you. At that point, you can take over as Project Manager and shepherd the book through the rounds. When it enters post-processing you can, if you choose, do that step yourself, using software developed at Distributed Proofreaders for exactly that purpose. I guarantee you this will be easier and result in a higher-quality electronic book than trying to type in the whole thing and proofread it yourself. In any case, before trying to do a solo project, I strongly recommend you spend some time at Distributed Proofreaders, get some experience in the proofreading and formatting rounds, and post-process one or two books from DP's pool. Over time, DP has figured out what works pretty well and what doesn't. You may not agree with all of our procedures--if you read this list for more than a few days, you'll see that a lot of people don't--but working through your first few books with a set of carefully established guidelines and a forum full of helpful, experienced folks is the best education I know of. From klofstrom at gmail.com Sat Mar 1 20:21:09 2008 From: klofstrom at gmail.com (Karen Lofstrom) Date: Sat, 1 Mar 2008 18:21:09 -1000 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <41fd8970803011947p7a63350dh8673b7bddec00048@mail.gmail.com> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> <41fd8970803011947p7a63350dh8673b7bddec00048@mail.gmail.com> Message-ID: <1e8e65080803012021l6c8a47cfp5d4dd196fd3db91e@mail.gmail.com> On 3/1/08, Steven desJardins wrote: > It's possible to make such a contribution at Distributed Proofreaders, > even without a scanner or OCR software. There are several sites, like > Google Book Search and the Internet Archive, which have scans of > public domain books. If you find one of the books you want to preserve > on one of these sites (and you check to make sure it has no missing > pages), then you should be able to find someone who will OCR the files > for you. At that point, you can take over as Project Manager and > shepherd the book through the rounds. When it enters post-processing > you can, if you choose, do that step yourself, using software > developed at Distributed Proofreaders for exactly that purpose. I > guarantee you this will be easier and result in a higher-quality > electronic book than trying to type in the whole thing and proofread > it yourself. Steven knows whereof he speaks; he's one of the more prolific of the content providers at DP. He likes to feel responsible for the finished result, so he usually does just as he says: PMs and then PPs the book. I've done this for a couple of books, but I'm much lazier than Steven and like to just proof, leaving the responsibility to others. I can assure you that if you PM and PP, you will feel that it's YOUR book. It will have your name on it too, in the acknowledgments. -- Zora From Bowerbird at aol.com Sat Mar 1 21:42:19 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 2 Mar 2008 00:42:19 EST Subject: [gutvol-d] The Old Fashioned Way... Message-ID: kevin said: > If I wish to add a public domain book to the project, > and I actually desire to type it up by hand, > is there any reason why I can't? no, there's no reason you can't. and indeed, it is a _wonderful_ way to interact deeply with a book. when a book is truly meaningful to you, you will absorb it into your d.n.a. if you type it by hand... it's very time-consuming, yes, but also rewarding. *** you might look on the net to see if the book has already been digitized... if it has, then you would be doing the world an equally worthy favor if you _proofed_ that existing copy. just an idea for you. *** also, if you need help in finalizing your type-ins, i'd be honored to give you some of my software... i'll even go further and check your work myself... i've submitted to p.g., so i know the requirements. *** steven said: > I guarantee you this will be easier and result in > a higher-quality electronic book than trying to > type in the whole thing and proofread it yourself. and i guarantee you that, with my help, kevin, you will be able to create a higher-quality electronic-book than d.p. *** steven said: > In any case, before trying to do a solo project, > I strongly recommend you spend some time > at Distributed Proofreaders, get some experience > in the proofreading and formatting rounds, and > post-process one or two books from DP's pool. how very ironic that this post should come through today. because just _yesterday_, i got an e-mail from a person... a while back, he had come on the list, just like you, kevin, asking how he might begin the process of digitization... i recommended that he should go over to d.p. and join, proof some pages there, see how the system worked, etc. i told him not to pick up any bad habits over there, because they do a lot of things the wrong way, but to join and learn... well, i guess it didn't work out for him. he's back on his own. so i told him i'd help him out. and he sent me his text-files... and gosh, i'm looking at the mess that d.p. visits on a book... i've also been examining one of the "tests" they are running, and i've found it is 10 times more work _undoing_ their mess than it would've been if i'd started with the original materials. so i can no longer in good faith point anyone toward d.p. indeed, i think the best course is to recommend against them. *** steven said: > Over time, DP has figured out what works pretty well > and what doesn't. i disagree. vehemently. the d.p. workflow is _extremely_ bad. it wastes valuable time and energy of thousands of volunteers. the reason people think it's good is they don't know any better. > working through your first few books with a set of > carefully established guidelines and a forum full of > helpful, experienced folks is the best education I know of. if you could be immunized against the damage caused by exposure to the d.p. workflow, then you might well benefit from dialog with the volunteers in the forums, should you come across any rough spots when doing your digitization. there are people there with a lot of digitization experience... but as that's probably not possible, i'd advise you to stay away. again, this is a change from what i've recommended up to now. it was wrong to recommend them, so i have changed my mind... *** and now we come to the advice given to kevin by al haines. al is a _phenomenal_ digitizer -- he has done probably _hundreds_ of books submitted to d.p. -- and is a newly-deputized whitewasher. so it pains me a great deal to have to take issue with some of his points, however minor my disagreement might be. (and it's usually quite minor.) but i must. so i will. *** al said: > I don't know what word processor you're using, but try to > keep the lines the same length as they are in the original book.? > If possible, put a soft return at the end of each line (in MS-Word, > that's done with Shift-Enter).? (But use a hard return at paragraph end.)? > That way, you can do a line count of each typed page, > and if that doesn't match the book, you've done something wrong.? > The soft returns can be dealt with when the chapter is complete. ok, so let us begin with the slight disagreement. :+) don't "try" to keep the lines the same length as in the original book. instead, type the lines exactly _as_is_. even hyphenate the words which were hyphenated when the text hit against the right margin. it's true p.g. wants you to dehyphenate those words, but you can have me (have my software) do that dehyphenation for you, later. some people _want_ the linebreaks just as they were in the p-book, and there's absolutely no reason for you not to make them happy... besides, you will find it easier to get into the rhythm of the type-in if you get yourself in sync with the lines as they appear on the page. and, of course, _proofing_ -- which we all agree has to be done -- is _absolutely_ easier (by orders of magnitude) when the linebreaks in your text-file precisely match the linebreaks of the physical page. so put a hard-return at the end of each line, and 2 hard-returns at the end of each paragraph, and save yourself a ton of later misery... > Save each chapter as a separate file, in two formats -- > your word processor's native format and as a standard text file.? > Run Gutcheck, Jeebies, and Gutspell on the text version, > and fix problems in the native version.? you can break the file into separate chapters if you _want_ to. but there's certainly no _need_ to do it. (i'd think it's a hassle.) and for goodness sake, do _not_ use "gutcheck", or "jeebies", or "gutspell", even if you know what they are. (and if you don't, be glad that you don't have to bother learning what they are...) i will correct your text entirely by running checks using my tools. you _do_ need to run a spellcheck on your work, most certainly, but anyone who has learned to proof text by reading it backward doesn't need to be told something as basic and elementary as that. what you might not appreciate, however, at least not _sufficiently_, is the value of creating a specific "dictionary" for your spellchecker for each book. a book has a certain number of words unique to it -- such as names -- which will typically occur with great frequency. you don't want your spellchecker to stop at each and every one, but you might not wanna add the word to your _main_ dictionary either. so if your wordprocessor lets you do so -- and many of them do -- declare a "special" dictionary for each book you do, and use it then. (indeed, if your _main_ spellchecker doesn't allow this, it is worth the trouble to do the spellchecking in a wordprocesser that does.) don't use the "alternate" dictionary, allowed by some wordprocessors. make it a "special" dictionary, one that you'll use solely for that book. that way you can always go back to the book, at any time, and call up its special dictionary. very handy. as an aside, i find it fascinating to examine the special dictionary; gives you insight into the book itself. > If the book has footnotes, type them at the bottom of their > respective page and leave them there until the page is > thoroughly proofed.? When the chapter is complete, > they can be handled as per PG guidelines.? (I renumber them > sequentially and move them to the end of their chapter.) my programs automatically handle all of that footnote movement... just type 'em at the bottom of the page, exactly like you find them, and let me worry about the rest of it. oh yeah, _about_ pages... just as you've recorded the linebreaks as they were in the p-book, you'll need to mark the pagebreaks as well. i suggest you simply use the "pagebreak" command in your wordprocessor, but if you are using one that doesn't have such a command, then just type a line of dashes for a pagebreak. _do_ type the pagenumber too; you can either type it at the _bottom_ of each page, or the _top_, but do it _consistently_ -- even if the p-book did it inconsistently! also, because you want these p-book pagenumbers to be in sync with the pagenumbers as they're figured by your wordprocessor, put the frontmatter in one file, and the body-text in another file... that way, "page 1" in your wordprocessor will be the _real_ "page 1". let's see, is there anything else? text-styling! oh my goodness, i almost forgot. p.g. wants ascii, but you should _definitely_ record any text-styling when present, like italics and bold. use your regular wordprocessor formatting. when you are finished, you can convert it to the p.g. conventions. (what this means is that you do _not_ save your file as a text file.) i can create an .html file and a .pdf out of your work, if you want, so don't think you must do any special work to accomplish that... feel free to send me any of your work once you've started doing it. i'll be happy to give you feedback if you're doing anything wrong... and if google (or anyone else) has scanned the book you're doing, let me know, and i'll take a look at it to see if you need any advice... (i will also o.c.r. it for you, and compare the output to your type-in; that way, you won't even have to do the proofing if you don't want. in contrast to typing, proofing does _not_ imprint upon your d.n.a.) welcome to the world of book-digitizing... -bowerbird p.s. having your computer do text-to-speech on your file can be a _great_ way to do "proofing". the errors really jump out at you... ************** Ideas to please picky eaters. Watch video on AOL Living. (http://living.aol.com/video/how-to-please-your-picky-eater/rachel-campos-duffy/ 2050827?NCID=aolcmp00300000002598) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080302/619674e1/attachment-0001.htm From ajhaines at shaw.ca Sat Mar 1 23:17:56 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sat, 01 Mar 2008 23:17:56 -0800 Subject: [gutvol-d] The Old Fashioned Way... References: Message-ID: <001701c87c35$89af7f30$6501a8c0@ahainesp2400> Correction - I've never used DP for my submissions. Two reasons--one, I wasn't aware of DP until a year or so after I started doing ebooks for PG, and, two, I prefer the immediacy, control, and accuracy I can bring to an ebook by doing everything myself. (I did use DP's harvesting page some months ago to record a couple of harvests from Internet Archive, but did my own work on them, then stopped harvesting in favor of the several hundred books I have that aren't in PG *or* in IA.) Bowerbird - can you supply a list of the books you've submitted to PG? I'd like to have a look a them. Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Saturday, March 01, 2008 9:42 PM Subject: Re: [gutvol-d] The Old Fashioned Way... kevin said: > If I wish to add a public domain book to the project, > and I actually desire to type it up by hand, > is there any reason why I can't? no, there's no reason you can't. and indeed, it is a _wonderful_ way to interact deeply with a book. when a book is truly meaningful to you, you will absorb it into your d.n.a. if you type it by hand... it's very time-consuming, yes, but also rewarding. *** you might look on the net to see if the book has already been digitized... if it has, then you would be doing the world an equally worthy favor if you _proofed_ that existing copy. just an idea for you. *** also, if you need help in finalizing your type-ins, i'd be honored to give you some of my software... i'll even go further and check your work myself... i've submitted to p.g., so i know the requirements. *** steven said: > I guarantee you this will be easier and result in > a higher-quality electronic book than trying to > type in the whole thing and proofread it yourself. and i guarantee you that, with my help, kevin, you will be able to create a higher-quality electronic-book than d.p. *** steven said: > In any case, before trying to do a solo project, > I strongly recommend you spend some time > at Distributed Proofreaders, get some experience > in the proofreading and formatting rounds, and > post-process one or two books from DP's pool. how very ironic that this post should come through today. because just _yesterday_, i got an e-mail from a person... a while back, he had come on the list, just like you, kevin, asking how he might begin the process of digitization... i recommended that he should go over to d.p. and join, proof some pages there, see how the system worked, etc. i told him not to pick up any bad habits over there, because they do a lot of things the wrong way, but to join and learn... well, i guess it didn't work out for him. he's back on his own. so i told him i'd help him out. and he sent me his text-files... and gosh, i'm looking at the mess that d.p. visits on a book... i've also been examining one of the "tests" they are running, and i've found it is 10 times more work _undoing_ their mess than it would've been if i'd started with the original materials. so i can no longer in good faith point anyone toward d.p. indeed, i think the best course is to recommend against them. *** steven said: > Over time, DP has figured out what works pretty well > and what doesn't. i disagree. vehemently. the d.p. workflow is _extremely_ bad. it wastes valuable time and energy of thousands of volunteers. the reason people think it's good is they don't know any better. > working through your first few books with a set of > carefully established guidelines and a forum full of > helpful, experienced folks is the best education I know of. if you could be immunized against the damage caused by exposure to the d.p. workflow, then you might well benefit from dialog with the volunteers in the forums, should you come across any rough spots when doing your digitization. there are people there with a lot of digitization experience... but as that's probably not possible, i'd advise you to stay away. again, this is a change from what i've recommended up to now. it was wrong to recommend them, so i have changed my mind... *** and now we come to the advice given to kevin by al haines. al is a _phenomenal_ digitizer -- he has done probably _hundreds_ of books submitted to d.p. -- and is a newly-deputized whitewasher. so it pains me a great deal to have to take issue with some of his points, however minor my disagreement might be. (and it's usually quite minor.) but i must. so i will. *** al said: > I don't know what word processor you're using, but try to > keep the lines the same length as they are in the original book. > If possible, put a soft return at the end of each line (in MS-Word, > that's done with Shift-Enter). (But use a hard return at paragraph end.) > That way, you can do a line count of each typed page, > and if that doesn't match the book, you've done something wrong. > The soft returns can be dealt with when the chapter is complete. ok, so let us begin with the slight disagreement. :+) don't "try" to keep the lines the same length as in the original book. instead, type the lines exactly _as_is_. even hyphenate the words which were hyphenated when the text hit against the right margin. it's true p.g. wants you to dehyphenate those words, but you can have me (have my software) do that dehyphenation for you, later. some people _want_ the linebreaks just as they were in the p-book, and there's absolutely no reason for you not to make them happy... besides, you will find it easier to get into the rhythm of the type-in if you get yourself in sync with the lines as they appear on the page. and, of course, _proofing_ -- which we all agree has to be done -- is _absolutely_ easier (by orders of magnitude) when the linebreaks in your text-file precisely match the linebreaks of the physical page. so put a hard-return at the end of each line, and 2 hard-returns at the end of each paragraph, and save yourself a ton of later misery... > Save each chapter as a separate file, in two formats -- > your word processor's native format and as a standard text file. > Run Gutcheck, Jeebies, and Gutspell on the text version, > and fix problems in the native version. you can break the file into separate chapters if you _want_ to. but there's certainly no _need_ to do it. (i'd think it's a hassle.) and for goodness sake, do _not_ use "gutcheck", or "jeebies", or "gutspell", even if you know what they are. (and if you don't, be glad that you don't have to bother learning what they are...) i will correct your text entirely by running checks using my tools. you _do_ need to run a spellcheck on your work, most certainly, but anyone who has learned to proof text by reading it backward doesn't need to be told something as basic and elementary as that. what you might not appreciate, however, at least not _sufficiently_, is the value of creating a specific "dictionary" for your spellchecker for each book. a book has a certain number of words unique to it -- such as names -- which will typically occur with great frequency. you don't want your spellchecker to stop at each and every one, but you might not wanna add the word to your _main_ dictionary either. so if your wordprocessor lets you do so -- and many of them do -- declare a "special" dictionary for each book you do, and use it then. (indeed, if your _main_ spellchecker doesn't allow this, it is worth the trouble to do the spellchecking in a wordprocesser that does.) don't use the "alternate" dictionary, allowed by some wordprocessors. make it a "special" dictionary, one that you'll use solely for that book. that way you can always go back to the book, at any time, and call up its special dictionary. very handy. as an aside, i find it fascinating to examine the special dictionary; gives you insight into the book itself. > If the book has footnotes, type them at the bottom of their > respective page and leave them there until the page is > thoroughly proofed. When the chapter is complete, > they can be handled as per PG guidelines. (I renumber them > sequentially and move them to the end of their chapter.) my programs automatically handle all of that footnote movement... just type 'em at the bottom of the page, exactly like you find them, and let me worry about the rest of it. oh yeah, _about_ pages... just as you've recorded the linebreaks as they were in the p-book, you'll need to mark the pagebreaks as well. i suggest you simply use the "pagebreak" command in your wordprocessor, but if you are using one that doesn't have such a command, then just type a line of dashes for a pagebreak. _do_ type the pagenumber too; you can either type it at the _bottom_ of each page, or the _top_, but do it _consistently_ -- even if the p-book did it inconsistently! also, because you want these p-book pagenumbers to be in sync with the pagenumbers as they're figured by your wordprocessor, put the frontmatter in one file, and the body-text in another file... that way, "page 1" in your wordprocessor will be the _real_ "page 1". let's see, is there anything else? text-styling! oh my goodness, i almost forgot. p.g. wants ascii, but you should _definitely_ record any text-styling when present, like italics and bold. use your regular wordprocessor formatting. when you are finished, you can convert it to the p.g. conventions. (what this means is that you do _not_ save your file as a text file.) i can create an .html file and a .pdf out of your work, if you want, so don't think you must do any special work to accomplish that... feel free to send me any of your work once you've started doing it. i'll be happy to give you feedback if you're doing anything wrong... and if google (or anyone else) has scanned the book you're doing, let me know, and i'll take a look at it to see if you need any advice... (i will also o.c.r. it for you, and compare the output to your type-in; that way, you won't even have to do the proofing if you don't want. in contrast to typing, proofing does _not_ imprint upon your d.n.a.) welcome to the world of book-digitizing... -bowerbird p.s. having your computer do text-to-speech on your file can be a _great_ way to do "proofing". the errors really jump out at you... ************** Ideas to please picky eaters. Watch video on AOL Living. (http://living.aol.com/video/how-to-please-your-picky-eater/rachel-campos-duffy/2050827?NCID=aolcmp00300000002598) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080301/14e97ce8/attachment.htm From hyphen at hyphenologist.co.uk Sat Mar 1 23:38:59 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Sun, 2 Mar 2008 07:38:59 -0000 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com> <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com> Message-ID: <001801c87c38$7b822a40$72867ec0$@co.uk> I have a spare scanner in the loft and an unused CDROM of Abbyy Finereader if there is any way of shipping them to you, or anyone else for that matter. I am in the UK. Dave Fawthrop From: gutvol-d-bounces at lists.pglaf.org [mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of Robert Cicconetti Sent: 01 March 2008 19:14 To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] The Old Fashioned Way... On Sat, Mar 1, 2008 at 2:08 PM, Kionon wrote: That, and of course, I lack a scanner, and am not located within a reasonable radius of a location with which to obtain one (I am, in fact, not even in an English speaking country). You have to have access to a scanner (or find a scan online of the same edition) in order to use the copyright clearance process at copy.pglaf.org. I think there is an older process involving mailing in photocopies of the title/verso to MH, but I have never used it... I -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080302/74bc21be/attachment-0001.htm From prosfilaes at gmail.com Sun Mar 2 07:54:19 2008 From: prosfilaes at gmail.com (David Starner) Date: Sun, 2 Mar 2008 10:54:19 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: References: Message-ID: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com> Let me note, Kionon, that DP has produced 12,000 volumes for Project Gutenberg, and that the tools that Bowerbird dismisses, gutprep, gutcheck and jeebies, have been used on most of those. They're fairly well documented and have the source code available if you're inclined to dig deeper or change things. On the other hand, Bowerbird has never done a book for Project Gutenberg, rarely shares his tools and never his code. Note also how often he says "i will" instead of "here's this tool that will let you" or "here's a webpage to show you how". That would be very concerning to me, if I were to work on a project. From Bowerbird at aol.com Sun Mar 2 12:24:48 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 2 Mar 2008 15:24:48 EST Subject: [gutvol-d] The Old Fashioned Way... Message-ID: al said: >?? Bowerbird - can you supply a list of the books >?? you've submitted to PG?? I'd like to have a look a them. just one, al.? "the universe (or nothing)" by meyer moldeven.? #18257. it was actually never published, so it was a "type-in" in the purest sense, which meant that i mostly did the _editing_ on it, not "proofing" per se... meyer is an old guy who wanted the future to have access to his story... when he posted of his intentions, i offered to help him do a submission. i would _love_ to hear your feedback on my treatment of meyer's book. indeed, if anyone can find anything i did wrong on it, i'd appreciate it... it is a completely normal book -- just chapter after chapter of text -- so it wasn't like the _formatting_ of it was difficult. but since it hadn't been through the hands of a professional copy-editor at a publisher, it had lots of copy-editing glitches, so i had to write tools to detect 'em. but there was no one checking _my_ work, to see if i had made errors... so if anyone wants to do that, i would be nice. heck, even if you want to do it so you can poke me in the eye with a mistake, feel free to proceed... -bowerbird ************** Ideas to please picky eaters. Watch video on AOL Living. (http://living.aol.com/video/how-to-please-your-picky-eater/rachel-campos-duffy/ 2050827?NCID=aolcmp00300000002598) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080302/f3f0064a/attachment.htm From schultzk at uni-trier.de Mon Mar 3 00:10:06 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Mon, 3 Mar 2008 09:10:06 +0100 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com> References: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com> Message-ID: <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de> Hi Everybody, Just followed this thread and I ask how ignorant can people get?? 1) Somebody is willing to support and contribute to the project !! 2) The project mongers say no way Jos? you are tooo backwards !!? 3) The work you are goning to do will be bad !!??? 4) You need different hardware or we do not want you !!?? If somebody wants to contribute let them. If they want to do them by hand then all the more power to them. As for typing it in by HAND the old fashioned way. I would trust my girl friend more than any old (or newer) scanner/OCR system. Why you ask she is a professional secretary and she will out do any scanner on the first several pages. She does not need to correct anything. No time wasted. I can even dictate to her and in it goes into the computer!!! To that is proof enough that single persons can be proficient enough. Please do not bang on those who are willing to good OLD FASHIONED HANDY WORK !! Kionon go for it. Do not be stopped by the ignorant. There IS NOTHING stopping you from doing it and getting it contributed to PG. DP maybe, but then again DP is not PG regards Keith. From Catenacci at Ieee.Org Mon Mar 3 04:07:05 2008 From: Catenacci at Ieee.Org (Onorio Catenacci) Date: Mon, 3 Mar 2008 07:07:05 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de> References: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com> <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de> Message-ID: On Mon, Mar 3, 2008 at 3:10 AM, Schultz Keith J. wrote: > Hi Everybody, > > Just followed this thread and I ask how ignorant can people > get?? > > 1) Somebody is willing to support and contribute to the project !! > > 2) The project mongers say no way Jos? you are tooo backwards !!? > > 3) The work you are goning to do will be bad !!??? > > 4) You need different hardware or we do not want you !!?? > > If somebody wants to contribute let them. If they want to do them by > hand > then all the more power to them. > > As for typing it in by HAND the old fashioned way. I would trust my > girl friend > more than any old (or newer) scanner/OCR system. Why you ask she is > a professional > secretary and she will out do any scanner on the first several > pages. She does not > need to correct anything. No time wasted. I can even dictate to her > and in it goes > into the computer!!! > > To that is proof enough that single persons can be proficient > enough. Please do not > bang on those who are willing to good OLD FASHIONED HANDY WORK !! > > Kionon go for it. Do not be stopped by the ignorant. There IS > NOTHING stopping you > from doing it and getting it contributed to PG. DP maybe, but then > again DP is not PG > > Hi Keith, There's one little issue that would prevent Kionon from contributing--that being how will anyone be able to check his electronic text against the original without scans? Unless someone else owns a copy of the book and they're willing to proof it, that seems like a fairly major problem to me. -- Onorio Catenacci III From joshua at hutchinson.net Mon Mar 3 05:26:23 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Mon, 3 Mar 2008 13:26:23 +0000 (GMT) Subject: [gutvol-d] The Old Fashioned Way... Message-ID: <683955382.169941204550783529.JavaMail.mail@webmail03> Don't be ignorant yourself. No one told him he COULDN'T do it. They merely told him why it is much much more likely to have problems. They also suggested seeing how DP does things to get a better idea of some "best practices". He even got suggestions on how to work around the problem that you need to have scans of the title and verso for clearance. Josh On Mar 3, 2008, schultzk at uni-trier.de wrote: Hi Everybody, Just followed this thread and I ask how ignorant can people get?? 1) Somebody is willing to support and contribute to the project !! 2) The project mongers say no way José you are tooo backwards !!? 3) The work you are goning to do will be bad !!??? 4) You need different hardware or we do not want you !!?? If somebody wants to contribute let them. If they want to do them by hand then all the more power to them. As for typing it in by HAND the old fashioned way. I would trust my girl friend more than any old (or newer) scanner/OCR system. Why you ask she is a professional secretary and she will out do any scanner on the first several pages. She does not need to correct anything. No time wasted. I can even dictate to her and in it goes into the computer!!! To that is proof enough that single persons can be proficient enough. Please do not bang on those who are willing to good OLD FASHIONED HANDY WORK !! Kionon go for it. Do not be stopped by the ignorant. There IS NOTHING stopping you from doing it and getting it contributed to PG. DP maybe, but then again DP is not PG regards Keith. _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d From joshua at hutchinson.net Mon Mar 3 05:28:05 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Mon, 3 Mar 2008 13:28:05 +0000 (GMT) Subject: [gutvol-d] The Old Fashioned Way... Message-ID: <433504138.170011204550885503.JavaMail.mail@webmail03> Robert, That's a fairly common thing. A *huge* majority of our books don't have scans to check against. Most would prefer we had them, but especially for our older stuff, we don't have anything. Josh On Mar 3, 2008, Catenacci at Ieee.Org wrote: Hi Keith, There's one little issue that would prevent Kionon from contributing--that being how will anyone be able to check his electronic text against the original without scans? Unless someone else owns a copy of the book and they're willing to proof it, that seems like a fairly major problem to me. -- Onorio Catenacci III _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d ober From hyphen at hyphenologist.co.uk Mon Mar 3 07:55:23 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Mon, 3 Mar 2008 15:55:23 -0000 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: References: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com> <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de> Message-ID: <001601c87d47$01878db0$0496a910$@co.uk> Back in the *old* days proofing was done by typing in two copies and then "diffing" the two. Dave F -----Original Message----- From: gutvol-d-bounces at lists.pglaf.org [mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of Onorio Catenacci Sent: 03 March 2008 12:07 To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] The Old Fashioned Way... There's one little issue that would prevent Kionon from contributing--that being how will anyone be able to check his electronic text against the original without scans? Unless someone else owns a copy of the book and they're willing to proof it, that seems like a fairly major problem to me. From schultzk at uni-trier.de Mon Mar 3 09:26:13 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Mon, 3 Mar 2008 18:26:13 +0100 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: References: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com> <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de> Message-ID: <9D3EFA16-00DC-4AA8-8E7F-6554F847089C@uni-trier.de> Hi, I wonder if this really a problem. Did it not use to be so? Like I said it is up to Kionon. To me it is just one more book. regards Keith Am 03.03.2008 um 13:07 schrieb Onorio Catenacci: > On Mon, Mar 3, 2008 at 3:10 AM, Schultz Keith J. trier.de> wrote: >> Hi Everybody, [snip, snip] >> Kionon go for it. Do not be stopped by the ignorant. There IS >> NOTHING stopping you >> from doing it and getting it contributed to PG. DP maybe, >> but then >> again DP is not PG >> >> > > Hi Keith, > > There's one little issue that would prevent Kionon from > contributing--that being how will anyone be able to check his > electronic text against the original without scans? Unless someone > else owns a copy of the book and they're willing to proof it, that > seems like a fairly major problem to me. > From schultzk at uni-trier.de Mon Mar 3 09:45:31 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Mon, 3 Mar 2008 18:45:31 +0100 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <683955382.169941204550783529.JavaMail.mail@webmail03> References: <683955382.169941204550783529.JavaMail.mail@webmail03> Message-ID: <0AE6D9CB-3E6B-4F00-B1B8-A8C95B6726B6@uni-trier.de> Hi Joshua, I knew you would BITE! But talking about ignorance: The Facts: 1) Karen lofstorm wrote on 1 March: > On 3/1/08, Kionon wrote: > > >> If I wish to add a public domain book to the project, and I >> actually desire >> to type it up by hand, is there any reason why I can't? >> > > Unfortunately, you can. I say "unfortunately" because it is close to > certain that you are going to produce a flawed text. Since Project > Gutenberg, at present, doesn't have any quality controls, it will > accept your flawed text. > Defacto she is saying do not do it because ... !!! 2) Robert wrote on 1 March > > On Sat, Mar 1, 2008 at 2:08 PM, Kionon > wrote: > That, and of course, I lack a scanner, and am not located within a > reasonable radius of a location with which to obtain one (I am, in > fact, not even in an English speaking country). > > You have to have access to a scanner (or find a scan online of the > same edition) in order to use the copyright clearance process at > copy.pglaf.org. I think there is an older process involving mailing > in photocopies of the title/verso to MH, but I have never used it... > Defacto if you do NOT have a scan ... No you can not do it. Yes, he how he might find a scan. Yet he MUST HAVE a scanned version. Well Joshua, Kevin ask if there is any reason he can offer his work to PG. He did not ask if it is what DP wants. I personally do not like DP, even though it does good work and is the the largest contributor to PG. DP does not have a monopoly. Though from your reaction one gets such a feeling. regards Keith. Am 03.03.2008 um 14:26 schrieb Joshua Hutchinson: > Don't be ignorant yourself. > > No one told him he COULDN'T do it. They merely told him why it is > much much more likely to have problems. > > They also suggested seeing how DP does things to get a better idea > of some "best practices". > > He even got suggestions on how to work around the problem that you > need to have scans of the title and verso for clearance. > > Josh > > [snip, snip] From ebooks at ibiblio.org Mon Mar 3 10:59:46 2008 From: ebooks at ibiblio.org (Jose Menendez) Date: Mon, 03 Mar 2008 13:59:46 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: References: Message-ID: <47CC4AA2.5050706@ibiblio.org> On March 2, 2008, Bowerbird wrote: > al said: > > Bowerbird - can you supply a list of the books > > you've submitted to PG? I'd like to have a look a them. > > just one, al. "the universe (or nothing)" by meyer moldeven. #18257. I take it you mean this ebook, "The Universe -- or Nothing." http://www.gutenberg.org/files/18257/18257.txt > it was actually never published, so it was a "type-in" in the purest sense, > which meant that i mostly did the _editing_ on it, not "proofing" per se... > meyer is an old guy who wanted the future to have access to his story... > when he posted of his intentions, i offered to help him do a submission. I vaguely remembered when Mr. Moldeven posted about it on the Book People mailing list, and a quick search turned up his post in the BP archive. "Seeking Internet Archive" (29 Jun 2005) http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&post=2005-06-29,6 You replied to him the next day, saying "meyer, i converted your science-fiction piece into z.m.l. format a while back, so it should be acceptable in that form to project gutenberg..." "re: Seeking Internet Archive" (30 Jun 2005) http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&post=2005-06-30,2 Note the date of your BP post, June 30, 2005. Now, if we look at the PG ebook, we see "Release Date: April 25, 2006." That's nearly *ten* full months after your BP post. It's a good thing you don't use those inefficient DP workflows you're always criticizing. ;) > indeed, if anyone can find anything i did wrong on it, i'd appreciate it... > > it is a completely normal book -- just chapter after chapter of text -- > so it wasn't like the _formatting_ of it was difficult. but since it hadn't > been through the hands of a professional copy-editor at a publisher, > it had lots of copy-editing glitches, so i had to write tools to detect 'em. > > but there was no one checking _my_ work, to see if i had made errors... Well, back in late January of 2006, you asked me to check it, but I turned you down. > so if anyone wants to do that, i would be nice. heck, even if you want to > do it so you can poke me in the eye with a mistake, feel free to proceed... A quick check of a simple word frequency list was enough to find a mistake. For instance, the ebook contains two occurrences of "accello-net" (whatever that may be) and one "accello-nets," but there's also one "accelo-nets." There's an "l" missing in that one. The word frequency list also revealed a number of hyphenation inconsistencies, for example, "interregional" vs. "inter-regional," "mine-layer" vs. "minelayers," "multicolored" vs. "multi-colored," etc. Now some may say that I'm just nitpicking about those inconsistencies, but I have a good reason. Back in mid-January of 2006, I made an ebook of Willa Cather's "My ?ntonia." Since Bowerbird had often taunted Jon Noring publicly about how long it was taking him to finish his version of the same book, I emailed Bowerbird, Jon, and David Rothman about mine. In the ensuing discussion, Bowerbird criticized my version because I had retained similar hyphenation inconsistencies that were in the original paper book, e.g. "grain-sack" vs. "grainsack" and "oil-cloth" vs. "oilcloth." Bowerbird told me forcefully and at great length that I should fix those inconsistencies. He even said that he'd fixed them in his own version of "My ?ntonia." So I was surprised to see so many similar inconsistencies in this ebook that he submitted to PG, especially since he submitted it *after* that lengthy discussion about the hyphenated words in Cather's book. Jose Menendez P.S. A few checks also revealed a number of punctuation errors. For example: "What now?", Zolan asked. That should be "What now?" Zolan asked. "Not much choice." Brad replied in a whisper. That should be "Not much choice," Brad replied in a whisper. "Don't count on it." Ram replied grimly. That should be "Don't count on it," Ram replied grimly. From Catenacci at Ieee.Org Mon Mar 3 11:32:31 2008 From: Catenacci at Ieee.Org (Onorio Catenacci) Date: Mon, 3 Mar 2008 14:32:31 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <433504138.170011204550885503.JavaMail.mail@webmail03> References: <433504138.170011204550885503.JavaMail.mail@webmail03> Message-ID: On Mon, Mar 3, 2008 at 8:28 AM, Joshua Hutchinson wrote: > Robert, > > That's a fairly common thing. A *huge* majority of our books don't have scans to check against. Most would prefer we had them, but especially for our older stuff, we don't have anything. > Ah. My bad for making an unjustified assumption. :-) -- Onorio Catenacci III From Bowerbird at aol.com Mon Mar 3 12:24:19 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 3 Mar 2008 15:24:19 EST Subject: [gutvol-d] The Old Fashioned Way... Message-ID: oh gee, lookee here, _jose_menendez_ has made an appearance! great to see you jose! even though i know you're here to razz me. *** jose said: > Note the date of your BP post, June 30, 2005. > Now, if we look at the PG ebook, we see "Release Date: April 25, 2006." > That's nearly *ten* full months after your BP post. > It's a good thing you don't use those inefficient DP workflows > you're always criticizing. ;) notice the smiley there, folks. that means jose is "just kidding". but he has a good point. just exactly why _did_ it take so long? well, the answer to that is pretty simple. meyer was still making _changes_ to his book. like many authors, he kept rewriting it... however, _unlike_ most editors, i didn't impose a deadline on him. so that accounted for a good chunk of that time. at some point, though, he did tire of the rewriting, and "finished". of course, by that time, i had other things on my plate, so it took me a little while to get back to it. and then we did copy-editing. and then we did more copy-editing. and then we did even more. if you've ever copy-edited a "raw" book, you know it takes time... and then, when _that_ was done, i fully intended to demonstrate the .pdf and .html conversion possibilities, but still had to program them. i'm not one of those disciplined programmers, who can make myself code "on-demand". i have to wait for "the inspiration". and it wasn't all that forthcoming. so finally meyer wrote me, after a heart-attack, saying "i don't know if i'm long for this world; can we post my book?", so i did. thank goodness, as far as i know, he's still alive and kicking... and _that's_ why it took 10 months. actually, i would have guessed that he waited at least that long just for me to do the programming, so if that was the _total_ time, then i'm a little bit surprised... > Well, back in late January of 2006, you asked me to check it, because you're one of the best at finding errors, jose, and i know it. so i figured that if you couldn't find an error, then _nobody_ could... > but I turned you down. yeah. you never were one to do me a favor, were you? ;+) > A quick check of a simple word frequency list > was enough to find a mistake. For instance, notice that "for instance", folks. translated into jose, that means "here's one of the errors i found." the _unspoken_ part of that is that he has found _more_, he's just ain't gonna tell you, not yet... > the ebook contains two occurrences of "accello-net" > (whatever that may be) and one "accello-nets," but there's > also one "accelo-nets." There's an "l" missing in that one. i don't even know what an "accello-net" is. _or_ an "accelo-net". or the difference between them. or if there is a difference. :+) > The word frequency list also revealed a number of hyphenation > inconsistencies, for example, "interregional" vs. "inter-regional," > "mine-layer" vs. "minelayers," "multicolored" vs. "multi-colored," etc. another good catch! as i said, meyer did a lot of rewriting on this. so it's obvious that i needed to do the hyphenation checks again... but it's not like i didn't do them a half-dozen times before that... or maybe my hyphenation checks just weren't too good back then. who knows, maybe they're not even too good _now_. or maybe so. > Now some may say that I'm just nitpicking about those inconsistencies people who said that would be _wrong_, as far as i'm concerned... and _i'm_ certainly not going to be one of those people who says it. i _want_ to know about any inconsistencies, and eliminate them. so i don't consider it to be "nitpicking", jose, not in the slightest... they might not be a _serious_ error -- ok, they're _not_ a serious error -- but they are an error nonetheless, and i want _all_ errors to be corrected. so i am _deeply_ appreciative to you for bringing them to my attention... now, how about the _other_ errors you found... :+) > Since Bowerbird had often taunted Jon Noring publicly > about how long it was taking him to finish his version > of the same book, I emailed Bowerbird, Jon, and David Rothman > about mine. In the ensuing discussion, Bowerbird criticized > my version because I had retained similar hyphenation inconsistencies > that were in the original paper book, e.g. "grain-sack" vs. "grainsack" > and "oil-cloth" vs. "oilcloth." i'm quite sure i didn't _criticize_ you for retaining them, jose. the decision about whether to _retain_ such inconsistencies is one that can go either way... since you consider yourself to be _replicating_ the p-book, your decision would be to retain them. since i consider myself to be _republishing_ the p-book, i fix 'em. different strokes for different folks; that's what makes a horse race. > Bowerbird told me forcefully you might have interpreted my posts as being "forceful" -- probably because of the strength of the logic -- but that's your interpretation. > and at great length well, i do go on for a while. but you're just jealous, jose, because you're a 2-finger typist who can't type nearly as fast as he thinks... :+) (have you tried voice-recognition apps? i hear they're good now.) > that I should fix those inconsistencies. well, no. the nature of the discussion would have revolved around the general issue of whether it is better to _replicate_ or _republish_, not around the consequent issue of whether to keep inconsistencies. > He even said that he'd fixed them > in his own version of "My ?ntonia." right. because that's what a republisher _should_ do... > So I was surprised to see so many similar inconsistencies > in this ebook that he submitted to PG, especially since he > submitted it *after* that lengthy discussion about > the hyphenated words in Cather's book. there's absolutely no question those are errors that need to be fixed. thank you for showing them to me. like i said, you da best... :+) > P.S. A few checks also revealed a number of punctuation errors. > For example: > > "What now?", Zolan asked. > That should be > "What now?" Zolan asked. really? i would say the comma is correct. if it's not, i'll need to make a check for it... > "Not much choice." Brad replied in a whisper. > That should be > "Not much choice," Brad replied in a whisper. > > "Don't count on it." Ram replied grimly. > That should be > "Don't count on it," Ram replied grimly. i'll have to check the context, but i would agree those look wrong. thing is, i don't see how i can automate a check for that. do you? i didn't proofread this book. meyer said other people had done that. i only subjected it to my automated tests. now, i suppose that i could locate every occurrence of "period-quotemark-space-name-replied". but that would fail on occurrences of "responded" instead of "replied", or "said" or "snorted" or "taunted" or any of a number of similar terms. i'd also guess that test will turn up too large a number of false alarms. so it doesn't seem to me to be a _practical_ test to include in my tool. however, if someone suggests a better way for me to phrase the test -- and feel free to use regex if it makes it possible for you to do it -- by all means, please show off your cleverness and share it with me... i'm curious, does gutcheck find this type of error? -bowerbird p.s. also, just for the record, and so everyone is absolutely clear here, i didn't _force_ any changes on meyer. didn't even make them for him, for the most part. i'd just send him a list of "stuff that i would change", and he'd either make the changes or not, depending on his own mind... even if something was "wrong" grammatically, if he _wanted_ it that way, that's the way it stayed. when you have a living author, you have _zero_ difficulty determining "the intent of the author", so i gave him free reign. i do not think he'd be stubborn about fixing the errors reported above, so i'm not offering that here as an _excuse_, because it does not apply, but i felt the need to say it to set the record straight... and jose, thanks! ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080303/fa89c54d/attachment.htm From prosfilaes at gmail.com Mon Mar 3 16:06:36 2008 From: prosfilaes at gmail.com (David Starner) Date: Mon, 3 Mar 2008 19:06:36 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de> References: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com> <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de> Message-ID: <6d99d1fd0803031606n14fdf485g768e2eaa18f96d7@mail.gmail.com> On Mon, Mar 3, 2008 at 3:10 AM, Schultz Keith J. wrote: > Just followed this thread and I ask how ignorant can people > get?? If you define ignorant as disagreeing with you, very. But I think that's an overly parochial definition. > If somebody wants to contribute let them. If they want to do them by > hand > then all the more power to them. Each book in Project Gutenberg reflects on the quality of the whole. Not only that, Novel by Joe Shmoe getting posted will stop most other people from working on it at all, which means that a poor-quality edition will stop a high-quality edition from being posted. From my perspective, that's motivation to encourage people to submit only high-quality copies to PG. > To that is proof enough that single persons can be proficient > enough. "I know a person who is perfect at this" is hardly proof; it's barely even an argument. We've all seen the opposite; DP is proofing the motion picture copyright filings, and is finding that the original typing has left several errors a page. To achieve the results that DP is achieving, most companies have two typists independently type out the text. >Please do not > bang on those who are willing to good OLD FASHIONED HANDY WORK !! Hard work frequently isn't a substitute for using the right tools and right knowledge. The man who picks up a hammer one day and starts building houses for people may be altruistic, but without the right knowledge, he's also endangering lives. > Well Joshua, Kevin ask if there is any reason he can offer his work > to PG. He did not ask if it is what DP wants. DP doesn't want anything; the people who work with DP do. And most of them want PG to be the greatest it can be. Perhaps we assume, na?vely perhaps, that other people would share that goal. > I personally do not like DP, Didn't you just say >Please do not > bang on those who are willing to good OLD FASHIONED HANDY WORK !! ? From Bowerbird at aol.com Tue Mar 4 11:41:51 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 4 Mar 2008 14:41:51 EST Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people Message-ID: robert said: > how will anyone be able to check his electronic text > against the original without scans? the same way we check the other books without scans, by finding a copy of the p-book or finding a scan-set... by the way, any progress on the process of uploading the scans from d.p. to p.g.? c'mon folks, get that done. if you can't do it yourselves, i'll be happy to do it for you, working from the p.g. side, if michael and greg approve. *** keith said: > If somebody wants to contribute let them. > If they want to do them by?hand > then all the more power to them. i think the general message that he _could_ contribute got through to kevin. (but perhaps kevin could tell us?) yes, the impression that he had to present a scan of the titlepage and verso was misleading, but robert also was fairly quick to correct himself and give kevin an option... people were mostly concerned about the quality issue. and yeah, that's kind of a red herring, because there are people (like your girlfriend) who can do excellent quality, and we have no idea about the nature of kevin's skills here. (although if he's willing to do it, he's likely not a bad typist.) still, it's probably good to sensitize people to that issue... however... a recommendation to use the gutcheck tools is not good. first and foremost in this specific case, they are geared to the mistakes typical to the process of o.c.r., _not_ typing... typists don't make he/be errors, or confuse o/c, m/rn, etc. a wordprocessor's spellchecker fixes _typing_ errors fine... furthermore, and perhaps this issue is really _foremost_, those tools are exceedingly difficult for people to install. some of them also require the installation of _libraries_... continuing in this vein, those tools are _not_ easy to use. as script-based tools that generate lists of potential errors, they are out-of-touch in a world now centered on the g.u.i. now, obviously, if you installed these tools years ago, and you've been using them for years, you will not be sensitive to these concerns. but the _average_ person -- instructed to use them before submitting a book -- might well give up on even _doing_ that book instead... besides, whitewashers will run those tools on the book that kevin submits anyway. it's not like it goes undone... so, instead, i offered my tools to kevin. they require no installation. they're easy to operate, with a g.u.i. on 'em. moreover, they do the job, as well as those other tools... you can bet that if they _didn't_, people would already have pointed out to me errors in the book i submitted. after all, how long does it take to run checks on a book? they ran them, and found that my submission was clean. and if you don't believe me, i dare you to run 'em yourself. (and, once again, my thanks to jose for reporting errors...) *** conspicuous by its absence, in contrast, was the failure of any d.p. people to say "we will check your book for you..." after all, they've already installed those checker-programs, and become experienced with using them. they didn't offer to step up and offer to help this new volunteer do what _he_ wants to do, which is to have fun typing in a book manually. instead, they advised him to do what _they_ want him to do, which is to become a part of their group. if he did join d.p., they'd stick him in the p1 round, proofing, and then maybe, after enough time on-site and he'd done enough p1 pages, he could _apply_ to take a _test_ which -- if he passed it -- he would "graduate" up to p2 proofing. whoop-dee-doo... this insensitivity, to the actual question which was being asked, is probably what made you angry, keith, in my humble opinion. the person who re-contacted me on friday, who i had advised to check out d.p., quit them because they told him he could _not_ process his own book, since he didn't have enough experience, at least if i have understood him properly. it's kinda sad, isn't it? *** i took issue with the suggestions to join d.p. because learning their workflow does more harm than good... that person who had followed my (former) advice to "join d.p. and observe how they do things over there" cut off runheads-and-pagenumbers when he did o.c.r., as per d.p. policy, which i've informed him is a bad idea. pagenumbers help you know where you're at in a book, and the runheads can be deleted later, automatically... he'd saved the file as a text-file, as per d.p. policy, so i had to tell him to do the o.c.r. again and save as .rtf. his book is full of styling. why throw all of that away? he rejoined end-of-line hyphenates, as per d.p. policy, so i told him to switch that up when he re-did the o.c.r. d.p. policy just plain _sucks_, on the full range of issues. i've already described their policy on ellipses, which is as wrong as a policy could be on that particular subject. they have proofers _changing_ what was in the p-book, just to implement their judgment-call-required policy. they "cloth" end-line em-dashes, meaning they bring up a word from the following line, which is totally ridiculous, since it creates a super-long "word" (i.e., it joins 2 words), which _exacerbates_ line-wrapping problems. so stupid. i have recently discussed here their _filenaming_ silliness, and will be continuing with that discussion momentarily... d.p. pseudo-markup is a put-it-in-take-it-out exercise. and i won't even begin to discuss the t.e.i. foibles again... at one choicepoint after another, d.p. takes the wrong fork. wrong. wrong. wrong. consistently... sometimes it seems as if they are actually _trying_ to handicap their volunteers... (i don't think so. but i _have_ heard the view espoused that it would be a good thing to try and slow down the p1 round, so as to "lessen the backlog it creates for the rest of the site", which is patently ridiculous. but still, i think it's incompetence accounting for awful decisions there, the same incompetence that makes it impossible for the leaders to craft a conspiracy.) further, in addition to the _policy_ problems, another level of incompetence rears its ugly head at _implementation_time_... for years now, d.p. tolerated some truly _awful_ page-images (badly-done, crooked, inconsistently placed on a canvas, etc.). although this appears to have improved recently -- because they are using scan-sets from the big scanning projects?, or maybe because i ragged on them so much about this -- there are still plenty of bad scans remaining in the system... another thing i have harped on ever since i encountered d.p. was their _negligence_ in performing any post-o.c.r. clean-up. o.c.r. errors often happen in repeated fashion through a book. it's not unusual at all to find _the_very_same_scanno_ occurring time after time after time. these problems can be quickly and easily corrected with one global change across the document. as just one example, old books often had "spacey" punctuation; that is, there was a space inserted before a comma or a period. it makes no sense to have humans delete these spaces manually, when they can be removed across an entire book automatically... in this regard, too, thanks to my constant haranguing on this, they've gotten better. but their performance still sucks badly. for instance, although they usually tend to remove the space in front of commas and periods, when i examined their "test" of "perpetual p1" over this last week, i discovered that they had _not_ closed up the space in front of the ellipses in that book... so proofers had to do that manually. to my mind, that shows that "the powers that be" (as they call themselves) who _run_ the place don't have the intelligence to insist that people who _prepare_ the texts show some consideration and respect for the time and energy of the people who are doing the proofing. the response -- when i've made this point in the past -- is that "the content providers are volunteers, and we can't force them to do what they don't want to do". pardon me? that's garbage. why do you allow _some_ volunteers to place an unfair burden on the shoulders of _other_ volunteers? if a job can be done _efficiently_ and _automatically_ and _simply_ by one volunteer, why would you instead insist that the job be done by _many_ volunteers who can only do it _inefficiently_ and _manually_ and _with_relatively_much_more_difficulty_. i'm _positive_ that if we gave the content providers the choice between making one global change or literally _thousands_ of manual changes, they would choose the global change. how does the equation change when _other_ people_ are doing the work? it doesn't... and -- believe it or not -- it gets even worse. in that same "perpetual p1" experiment, the pre-processor had accidentally changed all 1,137 of the em-dashes to en-dashes. did they go back and fix that disastrous mistake? they did not. they just sent the badly disfigured text out for the proofers to fix. that's totally inexcusable. and sadly, it is _not_ that uncommon... all kinds of incompetence is routinely dumped on proofers to fix. d.p. simply does not respect the time and energy of its volunteers. it's a good thing those volunteers don't realize the extent of this, or they would leave in droves... as it is, i think the _intelligent_ proofers are leaving as individuals, quietly, without making any fuss. after all, how long would _you_ continue to close up those ellipses, or correct those em-dashes, on an individual basis, before you became bored outta your skull? or how long before you spoke up to say, "um, there's a better way"? *** finally, the d.p. forums are a _fascinating_ example of groupthink... time after time, the correct answer to a problem goes unrecognized even when it _does_ surface, and occurrences of _that_ are even rare. there is a _huge_ propensity to make everything far too complicated, and strange unwillingness to even _experiment_ with simple solutions. they've convinced themselves digitization is a _difficult_undertaking_, and do not seem to want to be presented the reality that it is _not_... if you want some evidence, you need look no further than this page: > http://www.pgdp.net/w/index.php?title=Confidence_in_Page_analysis that wikipage is accompanied by an 11-page thread in the forums: > http://www.pgdp.net/phpBB2/viewtopic.php?p=431333#431333 they're trying to come up with a way to determine if a page is "done". they've got a lot of gobbledygook there, but not many solid results. or take a look at this 4-page thread on the simple matter of filenames: > http://www.pgdp.net/phpBB2/viewtopic.php?t=32038 after 60 messages on this topic (not to mention _many_ threads where this topic was discussed previously), they're now thoroughly confused, and are actually heading down a path going in the wrong direction... but to see how _really_ convoluted things can get, do a forum search on "wordcheck", and check out the huge threads that were generated. they ended up with a flawed-but-acceptable checker out of that mess, but one that still doesn't respect the time and energy of the proofers... nonetheless, it's better than the on-site spellchecker used previously, so awful it didn't even have the capacity to add words to its dictionary, which meant that for many unique words in a book (like the _names_ of characters), the proofers had to see _every_occurrence_ be flagged. *** consider all of this -- and i mean _all_ of it -- and it's easy to see why i believe that d.p. doesn't respect its volunteers sufficiently... and that's why i can no longer in good faith send people over there. in fact, i recommend that p.g. take the banner off the web-portal that recommends that visitors go over to distributed proofreaders, until d.p. cleans up its act... -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/24165b50/attachment-0001.htm From ebooks at ibiblio.org Tue Mar 4 13:05:54 2008 From: ebooks at ibiblio.org (Jose Menendez) Date: Tue, 04 Mar 2008 16:05:54 -0500 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: References: Message-ID: <47CDB9B2.7010404@ibiblio.org> Bowerbird wrote: > oh gee, lookee here, _jose_menendez_ has made an appearance! > great to see you jose! even though i know you're here to razz me. If I'd wanted to "razz" you, I would have replied to one or more of your recent posts about file-naming. :) You see, now and then, I like to check on what sites link to my ebooks. Some time back, I saw that there were links from the MobileRead Forums, specifically from this post you made in a thread entitled "What 'Cleaning Up' Do Project Gutenberg Texts Need." http://www.mobileread.com/forums/showpost.php?p=112962&postcount=86 In it you linked to my Einstein, Geronimo, and Cather digital reprints. (Oddly enough, you didn't link to my reprint of Mabie's "Books and Culture," but that's irrelevant.) Here's a brief excerpt from your MobileRead post: > here's another digital reprint, this time geronimo's life story: >> http://www.ibiblio.org/ebooks/Geronimo/GerStory.pdf > compare any .pdf page with its scan by using this template: >> http://z-m-l.com/go/geron/geronp001.jpg > (as before, replace "001" with the page-number you want.) > by the way, google's scan-set from this book is the _worst_ > job of scanning a book that i have ever seen from them... > it's worth downloading just for its humor as a bad example. Your comments about the quality of Google's scan-set surprised me, because the page images I had looked at were pretty good. So I followed the link you'd given to your website: http://z-m-l.com/go/geron/geronp001.jpg Much to my surprise, I saw an image of a half-title page, which is definitely not page 1 of the book. Hmmm... Next I tried to look at page 145 with this URL: http://z-m-l.com/go/geron/geronp145.jpg The scan for page 119 came up instead. Uh oh! So then I tried the URL that should have shown page 119: http://z-m-l.com/go/geron/geronp119.jpg Page 99 came up in its place. Oops! So I tried this URL for page 99: http://z-m-l.com/go/geron/geronp099.jpg Page 83 came up instead. I finally did find the scan for page 145, using this URL: http://z-m-l.com/go/geron/geronp173.jpg It's a good thing you have "a tool that does this file renaming _automatically_"; otherwise, those scans might have had the wrong file names. ;) By the way, the scans on your website do look bad, but here are links to the same page scans in Google Book Search, and they look considerably better: http://books.google.com/books?id=EM6nHWWQ3TIC&pg=PA83 http://books.google.com/books?id=EM6nHWWQ3TIC&pg=PA99 http://books.google.com/books?id=EM6nHWWQ3TIC&pg=PA119 http://books.google.com/books?id=EM6nHWWQ3TIC&pg=PA145 Jose Menendez From Bowerbird at aol.com Tue Mar 4 13:48:32 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 4 Mar 2008 16:48:32 EST Subject: [gutvol-d] The Old Fashioned Way... Message-ID: i told you jose wouldn't tell me about the other errors he found. so i'll just have to see if i can lure it out of him some other time. jose said: > If I'd wanted to "razz" you, I would have replied to > one or more of your recent posts about file-naming. :) there will be more of them coming soon, so you will have additional chances to jump in on this matter if you wish... :+) but... the version of geronimo that's up on my site right now is _not_ a finished version -- precisely because the book was so badly-done that some pages are totally missing... and, if i remember correctly, other pages are duplicated, sometimes several times. like i said, worst google book i have seen yet, which is quite an accomplishment, really. i think the guy who did it must've been drunk as a skunk. that's what makes this book "badly-done". yes, the quality of the images that _are_ there is suitable... but what good does that do if the scan-set is incomplete? anyway, i keep checking back to see if they've redone it. or if the o.c.a. has done it. or if _anyone_ has done it... > I followed the link you'd given to your website: > http://z-m-l.com/go/geron/geronp001.jpg > Much to my surprise, I saw an image of a half-title page, > which is definitely not page 1 of the book. Hmmm... > Next I tried to look at page 145 with this URL: > http://z-m-l.com/go/geron/geronp145.jpg > The scan for page 119 came up instead. then the files are obviously using the filenames based on pagenumbers associated with them from the google .pdf, which don't account for unnumbered plates in that book... so i must've uploaded the images before i renamed them. i can correct them pretty easily, with my file-renaming tool. that's what happens with badly-named files, occasionally, is that they get put into a production stream erroneously. that's why you should give 'em the right names right away. > It's a good thing you have > "a tool that does this file renaming _automatically_"; > otherwise, those scans might have had the wrong file names. ;) yes, it _is_ a good thing i have such a tool. it's even better when i remember to use it. :+) oh, wait, you're trying to _imply_ that i don't even _have_ such a tool, aren't you? what, do you think that i would rename all these files _manually_? maybe with 2 fingers? no, let me assure you that i do indeed have such a tool. in fact, over the years, i've written many different versions of it, including the one that i started just last week, for d.p. you might remember that i offered to write one for them... they didn't accept the offer, but i wrote it anyway (big deal) and i'll be releasing it regardless. but more on that _later_. > By the way, the scans on your website do look bad i probably pulled them out of the .pdf at a low resolution. my prechecking showed me that i couldn't finish the book -- i was hoping to repurpose your clean text in z.m.l. -- because it was incomplete, so i did a quickie on the scans, just so i could point people to your excellent geronimo .pdf. evidently, my "quickie" was a bit _too_ quick, if the filenames were incorrect. but nobody reported that error. until _you_. like i always say, jose, you're the best, jose. :+) -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/41e899a5/attachment.htm From ajhaines at shaw.ca Tue Mar 4 14:33:29 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 04 Mar 2008 14:33:29 -0800 Subject: [gutvol-d] The Old Fashioned Way... References: Message-ID: <001c01c87e47$c52197b0$6401a8c0@ahainesp2400> Several more errors, all found with one of the previously maligned Gutcheck/Jeebies/Gutspell trio: "fromtheir" - a simple spellcheck would also have found this one. There's a Chairman variously named Stabar, Straber, and (twice) Staber. The first two were in the same paragraph! A spellcheck would have flagged these, too. "caroomed" - a spellcheck would have found this one, but in a case like this, where I can't decide if it's the author's intent, or a typo/spello, I leave the word alone, and add a short transcriber's note with what I think is correct, e.g. "...caroomed [Transcriber's note: caromed?] ..." This line, cited earlier in this thread as incorrect, is definitely incorrect in its context: > "What now?", Zolan asked. However, given that there are too many ways in which the question/quote/comma sequence *is* correct, there's probably no way anything short of a full-blown grammar/syntax/context checker could declare a given case of the sequence correct or incorrect. Ditto for exclamation/quote/comma. And even then, *I* wouldn't take such a utility's word for it. (Many years ago (well before Windows), I ran my autoexec.bat file through a grammar checker, and was told it was readable, but dry. Maybe such checkers are better now, but the grain-of-salt principle still applies.) My take on this submission? Given the number of errors/inconsistencies found with assorted utilities (I've lost count, but at least 6-8 items, I think, so far), I can only assume there are others, possibly findable only with a proper proof-reading. If this had been my submission, 6-8 errors is 6-8 too many, and I would not have submitted as it stands. Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Monday, March 03, 2008 12:24 PM Subject: Re: [gutvol-d] The Old Fashioned Way... oh gee, lookee here, _jose_menendez_ has made an appearance! great to see you jose! even though i know you're here to razz me. *** jose said: > Note the date of your BP post, June 30, 2005. > Now, if we look at the PG ebook, we see "Release Date: April 25, 2006." > That's nearly *ten* full months after your BP post. > It's a good thing you don't use those inefficient DP workflows > you're always criticizing. ;) notice the smiley there, folks. that means jose is "just kidding". but he has a good point. just exactly why _did_ it take so long? well, the answer to that is pretty simple. meyer was still making _changes_ to his book. like many authors, he kept rewriting it... however, _unlike_ most editors, i didn't impose a deadline on him. so that accounted for a good chunk of that time. at some point, though, he did tire of the rewriting, and "finished". of course, by that time, i had other things on my plate, so it took me a little while to get back to it. and then we did copy-editing. and then we did more copy-editing. and then we did even more. if you've ever copy-edited a "raw" book, you know it takes time... and then, when _that_ was done, i fully intended to demonstrate the .pdf and .html conversion possibilities, but still had to program them. i'm not one of those disciplined programmers, who can make myself code "on-demand". i have to wait for "the inspiration". and it wasn't all that forthcoming. so finally meyer wrote me, after a heart-attack, saying "i don't know if i'm long for this world; can we post my book?", so i did. thank goodness, as far as i know, he's still alive and kicking... and _that's_ why it took 10 months. actually, i would have guessed that he waited at least that long just for me to do the programming, so if that was the _total_ time, then i'm a little bit surprised... > Well, back in late January of 2006, you asked me to check it, because you're one of the best at finding errors, jose, and i know it. so i figured that if you couldn't find an error, then _nobody_ could... > but I turned you down. yeah. you never were one to do me a favor, were you? ;+) > A quick check of a simple word frequency list > was enough to find a mistake. For instance, notice that "for instance", folks. translated into jose, that means "here's one of the errors i found." the _unspoken_ part of that is that he has found _more_, he's just ain't gonna tell you, not yet... > the ebook contains two occurrences of "accello-net" > (whatever that may be) and one "accello-nets," but there's > also one "accelo-nets." There's an "l" missing in that one. i don't even know what an "accello-net" is. _or_ an "accelo-net". or the difference between them. or if there is a difference. :+) > The word frequency list also revealed a number of hyphenation > inconsistencies, for example, "interregional" vs. "inter-regional," > "mine-layer" vs. "minelayers," "multicolored" vs. "multi-colored," etc. another good catch! as i said, meyer did a lot of rewriting on this. so it's obvious that i needed to do the hyphenation checks again... but it's not like i didn't do them a half-dozen times before that... or maybe my hyphenation checks just weren't too good back then. who knows, maybe they're not even too good _now_. or maybe so. > Now some may say that I'm just nitpicking about those inconsistencies people who said that would be _wrong_, as far as i'm concerned... and _i'm_ certainly not going to be one of those people who says it. i _want_ to know about any inconsistencies, and eliminate them. so i don't consider it to be "nitpicking", jose, not in the slightest... they might not be a _serious_ error -- ok, they're _not_ a serious error -- but they are an error nonetheless, and i want _all_ errors to be corrected. so i am _deeply_ appreciative to you for bringing them to my attention... now, how about the _other_ errors you found... :+) > Since Bowerbird had often taunted Jon Noring publicly > about how long it was taking him to finish his version > of the same book, I emailed Bowerbird, Jon, and David Rothman > about mine. In the ensuing discussion, Bowerbird criticized > my version because I had retained similar hyphenation inconsistencies > that were in the original paper book, e.g. "grain-sack" vs. "grainsack" > and "oil-cloth" vs. "oilcloth." i'm quite sure i didn't _criticize_ you for retaining them, jose. the decision about whether to _retain_ such inconsistencies is one that can go either way... since you consider yourself to be _replicating_ the p-book, your decision would be to retain them. since i consider myself to be _republishing_ the p-book, i fix 'em. different strokes for different folks; that's what makes a horse race. > Bowerbird told me forcefully you might have interpreted my posts as being "forceful" -- probably because of the strength of the logic -- but that's your interpretation. > and at great length well, i do go on for a while. but you're just jealous, jose, because you're a 2-finger typist who can't type nearly as fast as he thinks... :+) (have you tried voice-recognition apps? i hear they're good now.) > that I should fix those inconsistencies. well, no. the nature of the discussion would have revolved around the general issue of whether it is better to _replicate_ or _republish_, not around the consequent issue of whether to keep inconsistencies. > He even said that he'd fixed them > in his own version of "My ?ntonia." right. because that's what a republisher _should_ do... > So I was surprised to see so many similar inconsistencies > in this ebook that he submitted to PG, especially since he > submitted it *after* that lengthy discussion about > the hyphenated words in Cather's book. there's absolutely no question those are errors that need to be fixed. thank you for showing them to me. like i said, you da best... :+) > P.S. A few checks also revealed a number of punctuation errors. > For example: > > "What now?", Zolan asked. > That should be > "What now?" Zolan asked. really? i would say the comma is correct. if it's not, i'll need to make a check for it... > "Not much choice." Brad replied in a whisper. > That should be > "Not much choice," Brad replied in a whisper. > > "Don't count on it." Ram replied grimly. > That should be > "Don't count on it," Ram replied grimly. i'll have to check the context, but i would agree those look wrong. thing is, i don't see how i can automate a check for that. do you? i didn't proofread this book. meyer said other people had done that. i only subjected it to my automated tests. now, i suppose that i could locate every occurrence of "period-quotemark-space-name-replied". but that would fail on occurrences of "responded" instead of "replied", or "said" or "snorted" or "taunted" or any of a number of similar terms. i'd also guess that test will turn up too large a number of false alarms. so it doesn't seem to me to be a _practical_ test to include in my tool. however, if someone suggests a better way for me to phrase the test -- and feel free to use regex if it makes it possible for you to do it -- by all means, please show off your cleverness and share it with me... i'm curious, does gutcheck find this type of error? -bowerbird p.s. also, just for the record, and so everyone is absolutely clear here, i didn't _force_ any changes on meyer. didn't even make them for him, for the most part. i'd just send him a list of "stuff that i would change", and he'd either make the changes or not, depending on his own mind... even if something was "wrong" grammatically, if he _wanted_ it that way, that's the way it stayed. when you have a living author, you have _zero_ difficulty determining "the intent of the author", so i gave him free reign. i do not think he'd be stubborn about fixing the errors reported above, so i'm not offering that here as an _excuse_, because it does not apply, but i felt the need to say it to set the record straight... and jose, thanks! ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Tue Mar 4 15:41:54 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 4 Mar 2008 18:41:54 EST Subject: [gutvol-d] The Old Fashioned Way... Message-ID: al said: > all found with one of the previously > maligned Gutcheck/Jeebies/Gutspell trio: ok, now let's not perceive "maligning" where none was done. i've already explained that my advice is based on the fact that these tools are difficult for average users to install, and to use. is there anyone who takes issue with that? because i'd be happy to point them to the forums over at d.p., where experienced users need to assist less-experienced ones, and these threads clearly indicate that these tools are not easy. and i'm guessing they would be _quite_ formidable to users who have no digitizing experience _at_all_. (whether kevin is among those people or not, we have no real way of knowing.) besides, i'd already offered to help kevin clean up his work -- both by giving him my tools and by doing a check _myself_ -- so it's not as if i had left him out in the cold to freeze to death. why didn't anyone here offer to run the "the trio" for him? > If this had been my submission, 6-8 errors is 6-8 too many, > and I would not have submitted as it stands. gee, al, you're hard-core! :+) and what does this say about the "planet strappers" test over at distributed proofreaders? the p1-p2-p3 process left 6 errors, and it looks like the 3 iterations of p1 will leave about 10 errors. this in a book that's smaller (388k) than the one i did (480k)... i'm just finishing up my post where i've written up the results, and i gave those proofers a pat on the back for a job well done. it's far more important to clean up books as errors are reported than to try to make them perfect from the outset, in my opinion. *** as for the errors in meyer's book, i'll get them cleaned very soon, and resubmit. it will give me a chance to include .html and .pdf... and _thank_you_ very much for taking a look at my work! if i can return the favor on a book of yours, let me know... (or should i instead perceive that you are "maligning" me?) -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/135f8436/attachment-0001.htm From ajhaines at shaw.ca Tue Mar 4 16:43:57 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 04 Mar 2008 16:43:57 -0800 Subject: [gutvol-d] The Old Fashioned Way... References: Message-ID: <000d01c87e59$ff6e0a90$6401a8c0@ahainesp2400> Granted, Gutcheck/etc take some command line know-how to get working, but it's well worth the effort, even if some hand-holding is required. I don't consider myself to be particularly hardcore, but there's no excuse for not finding and fixing such errors as were found in this particular submission. My personal standard is to submit an e-book with fewer errors in it than the original. Obviously, I'm biased, so whether that standard has been met or not I leave to others to judge. ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Tuesday, March 04, 2008 3:41 PM Subject: Re: [gutvol-d] The Old Fashioned Way... al said: > all found with one of the previously > maligned Gutcheck/Jeebies/Gutspell trio: ok, now let's not perceive "maligning" where none was done. i've already explained that my advice is based on the fact that these tools are difficult for average users to install, and to use. is there anyone who takes issue with that? because i'd be happy to point them to the forums over at d.p., where experienced users need to assist less-experienced ones, and these threads clearly indicate that these tools are not easy. and i'm guessing they would be _quite_ formidable to users who have no digitizing experience _at_all_. (whether kevin is among those people or not, we have no real way of knowing.) besides, i'd already offered to help kevin clean up his work -- both by giving him my tools and by doing a check _myself_ -- so it's not as if i had left him out in the cold to freeze to death. why didn't anyone here offer to run the "the trio" for him? > If this had been my submission, 6-8 errors is 6-8 too many, > and I would not have submitted as it stands. gee, al, you're hard-core! :+) and what does this say about the "planet strappers" test over at distributed proofreaders? the p1-p2-p3 process left 6 errors, and it looks like the 3 iterations of p1 will leave about 10 errors. this in a book that's smaller (388k) than the one i did (480k)... i'm just finishing up my post where i've written up the results, and i gave those proofers a pat on the back for a job well done. it's far more important to clean up books as errors are reported than to try to make them perfect from the outset, in my opinion. *** as for the errors in meyer's book, i'll get them cleaned very soon, and resubmit. it will give me a chance to include .html and .pdf... and _thank_you_ very much for taking a look at my work! if i can return the favor on a book of yours, let me know... (or should i instead perceive that you are "maligning" me?) -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/e1c8945a/attachment.htm From Bowerbird at aol.com Tue Mar 4 16:49:59 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 4 Mar 2008 19:49:59 EST Subject: [gutvol-d] The Old Fashioned Way... Message-ID: about those geronimo files... i've discovered what my error was... i _should_ have pointed to _this_ directory: > http://z-m-l.com/go/gerst/ as you'll see, those files have been up since last november, and they are named _wisely_; the filenames relate to p-book pagenumbers, and unnumbered illustration plates stand out. i discovered these _after_ i had uploaded a _new_ set of corrections to the names, now shown here: > http://z-m-l.com/go/geron/ i renamed the folder with the badly-named files: > http://z-m-l.com/go/geronbad/ oh, and jose?, as for this page: > http://z-m-l.com/go/gerst/gerstp001.jpg i will routinely shuffle forward-matter pages to get the pagenumbers in sequence, as will many publishers when republishing a book... -bowerbird p.s. the google .pdf was missing some pages: > http://z-m-l.com/go/gerst/gerstp207.jpg > http://z-m-l.com/go/gerst/gerstp206.jpg > http://z-m-l.com/go/gerst/gerstp205.jpg > http://z-m-l.com/go/gerst/gerstp204.jpg > http://z-m-l.com/go/gerst/gerstp203.jpg > http://z-m-l.com/go/gerst/gerstp202.jpg > http://z-m-l.com/go/gerst/gerstp201.jpg > http://z-m-l.com/go/gerst/gerstp187.jpg > http://z-m-l.com/go/gerst/gerstp186.jpg > http://z-m-l.com/go/gerst/gerstp179.jpg > http://z-m-l.com/go/gerst/gerstp178.jpg > http://z-m-l.com/go/gerst/gerstf002.jpg ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/c4556498/attachment.htm From Bowerbird at aol.com Tue Mar 4 16:57:49 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 4 Mar 2008 19:57:49 EST Subject: [gutvol-d] The Old Fashioned Way... Message-ID: al said: > Granted, Gutcheck/etc take some command line know-how to get working, > but it's well worth the effort, even if some hand-holding is required. they might be good ways to find flaws in a digitization, i'd agree with that. but tools that work just as well, yet do _not_ require a complex installation, and which are easier to use (e.g., because they have a user-friendly g.u.i.) are -- in my opinion -- going to be superior, especially for the newbies... > I don't consider myself to be particularly hardcore i want all books to be perfect. but even in criticizing d.p., i have said that a book which had _50_ errors in it was not particularly badly done. > but there's no excuse for?not finding and fixing?such errors > as were found in this particular submission.?? no excuses are offered, al. i'm just gonna go fix them... > My personal?standard is to submit an e-book > with fewer errors in it than the original.? this _was_ "the original", al. straight from the author's wordprocessor. and i can guarantee it was a _lot_ cleaner thanks to my helping him... -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/3dfc346f/attachment.htm From schultzk at uni-trier.de Wed Mar 5 02:02:32 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Wed, 5 Mar 2008 11:02:32 +0100 Subject: [gutvol-d] The Old Fashioned Way... In-Reply-To: <6d99d1fd0803031606n14fdf485g768e2eaa18f96d7@mail.gmail.com> References: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com> <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de> <6d99d1fd0803031606n14fdf485g768e2eaa18f96d7@mail.gmail.com> Message-ID: <2BCDC783-216C-4464-8242-304B3079F724@uni-trier.de> Hi David, Am 04.03.2008 um 01:06 schrieb David Starner: > On Mon, Mar 3, 2008 at 3:10 AM, Schultz Keith J. trier.de> wrote: >> Just followed this thread and I ask how ignorant can people >> get?? > > If you define ignorant as disagreeing with you, very. But I think > that's an overly parochial definition. No. > >> If somebody wants to contribute let them. If they want to >> do them by >> hand >> then all the more power to them. > > Each book in Project Gutenberg reflects on the quality of the whole. > Not only that, Novel by Joe Shmoe getting posted will stop most other > people from working on it at all, which means that a poor-quality > edition will stop a high-quality edition from being posted. From my > perspective, that's motivation to encourage people to submit only > high-quality copies to PG. Hear you show what I mean by ignorance: Work down by hand(aka Type in) is considered per se flaw and poor quality. That is true ignorance. I say give Kevin a chance. YOU and EVERYBODYELSE does not know if Kevin does actually produce high quality work. > >> To that is proof enough that single persons can be proficient >> enough. > > "I know a person who is perfect at this" is hardly proof; it's barely > even an argument. We've all seen the opposite; DP is proofing the > motion picture copyright filings, and is finding that the original > typing has left several errors a page. To achieve the results that DP > is achieving, most companies have two typists independently type out > the text. Like I said above ignorance is presuming an outcome when it can not be determined. It is proof that it possible to produce high quality texts without a scanner. It is not proof that everybody can. >> Please do not >> bang on those who are willing to good OLD FASHIONED HANDY >> WORK !! > > Hard work frequently isn't a substitute for using the right tools and > right knowledge. The man who picks up a hammer one day and starts > building houses for people may be altruistic, but without the right > knowledge, he's also endangering lives. You never know! Some of the worlds best artist never had a formal education in art!! This is more true of many writers of the past. Yes, most who do put up a hammer do not know what they are doing. Also, I would trust most architects to actually build my house!! > >> Well Joshua, Kevin ask if there is any reason he can offer his work >> to PG. He did not ask if it is what DP wants. > > DP doesn't want anything; the people who work with DP do. And most of > them want PG to be the greatest it can be. Perhaps we assume, na?vely > perhaps, that other people would share that goal. It was the contributors to DP that gave ignorant advice. > >> I personally do not like DP, > > Didn't you just say >> Please do not >> bang on those who are willing to good OLD FASHIONED HANDY >> WORK !! DP is not OLD Fashioned Handy Work. Yet, I thank you for proving my point on ignorance. You call it short sightedness. regards Keith. From schultzk at uni-trier.de Wed Mar 5 02:09:14 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Wed, 5 Mar 2008 11:09:14 +0100 Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people In-Reply-To: References: Message-ID: Hi Bowerbird, Thanx for elicating more on my point. Am 04.03.2008 um 20:41 schrieb Bowerbird at aol.com: > robert said: > > how will anyone be able to check his electronic text > > against the original without scans? > > the same way we check the other books without scans, > by finding a copy of the p-book or finding a scan-set... > > by the way, any progress on the process of uploading > the scans from d.p. to p.g.? c'mon folks, get that done. > if you can't do it yourselves, i'll be happy to do it for you, > working from the p.g. side, if michael and greg approve. > > *** > > keith said: > > If somebody wants to contribute let them. > > If they want to do them by hand > > then all the more power to them. > > i think the general message that he _could_ contribute > got through to kevin. (but perhaps kevin could tell us?) > > yes, the impression that he had to present a scan of the > titlepage and verso was misleading, but robert also was > fairly quick to correct himself and give kevin an option... > > people were mostly concerned about the quality issue. > > and yeah, that's kind of a red herring, because there are > people (like your girlfriend) who can do excellent quality, > and we have no idea about the nature of kevin's skills here. > (although if he's willing to do it, he's likely not a bad typist.) > > still, it's probably good to sensitize people to that issue... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/4ddfc2ab/attachment-0001.htm From kionon at animemusicvideos.org Wed Mar 5 04:09:15 2008 From: kionon at animemusicvideos.org (Kionon) Date: Wed, 5 Mar 2008 21:09:15 +0900 Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people In-Reply-To: References: Message-ID: <8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com> I got the message no one was going to stop me from contributing. On the other hand the list did not make a good first impression. On 3/5/08, Schultz Keith J. wrote: > Hi Bowerbird, > > Thanx for elicating more on my point. > > > Am 04.03.2008 um 20:41 schrieb Bowerbird at aol.com: > robert said: > > how will anyone be able to check his electronic text > > against the original without scans? > > the same way we check the other books without scans, > by finding a copy of the p-book or finding a scan-set... > > by the way, any progress on the process of uploading > the scans from d.p. to p.g.? c'mon folks, get that done. > if you can't do it yourselves, i'll be happy to do it for you, > working from the p.g. side, if michael and greg approve. > > *** > > keith said: > > If somebody wants to contribute let them. > > If they want to do them by hand > > then all the more power to them. > > i think the general message that he _could_ contribute > got through to kevin. (but perhaps kevin could tell us?) > > yes, the impression that he had to present a scan of the > titlepage and verso was misleading, but robert also was > fairly quick to correct himself and give kevin an option... > > people were mostly concerned about the quality issue. > > and yeah, that's kind of a red herring, because there are > people (like your girlfriend) who can do excellent quality, > and we have no idea about the nature of kevin's skills here. > (although if he's willing to do it, he's likely not a bad typist.) > > still, it's probably good to sensitize people to that issue... > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From grythumn at gmail.com Wed Mar 5 05:18:03 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Wed, 5 Mar 2008 08:18:03 -0500 Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people In-Reply-To: <8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com> References: <8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com> Message-ID: <15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com> Shrug. There are several resident trolls on the list, whom the list moderators refuse to censor; nothing most of us can do about it except block their email and try to respond to other messages in a reasonable way. What did you end up deciding about your book? You have several options open to you, if the.. varied.. responses have not made you decide not to bother. PG doesn't actually REQUIRE much, aside from some sort of proof that the book is in the public domain. I will say, from experience, that I would recommend working on something short and simple first (I didn't[1], and had to redo a lot of work before I got it right, even sending the book through DP.) and finding someone to help you with it; you never did say what the title of the book is or where (approximately) you are located. R C [1] The first book I scanned for PG (Although not the first one posted :) ) is A Rudimentary Treatise on Clocks, Watches, and Bells[2], http://www.gutenberg.org/etext/17576 [2] Or, as I like to call it, the Evil Clock Book. On Wed, Mar 5, 2008 at 7:09 AM, Kionon wrote: > I got the message no one was going to stop me from contributing. > > On the other hand the list did not make a good first impression. > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/60f4bf76/attachment.htm From kionon at animemusicvideos.org Wed Mar 5 06:21:29 2008 From: kionon at animemusicvideos.org (Kionon) Date: Wed, 5 Mar 2008 23:21:29 +0900 Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people In-Reply-To: <15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com> References: <8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com> <15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com> Message-ID: <8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com> On 3/5/08, Robert Cicconetti wrote: > Shrug. There are several resident trolls on the list, whom the list > moderators refuse to censor; nothing most of us can do about it except block > their email and try to respond to other messages in a reasonable way. I was waiting until the squabbling ended, but since I was directly asked... > What did you end up deciding about your book? You have several options open > to you, if the.. varied.. responses have not made you decide not to bother. > PG doesn't actually REQUIRE much, aside from some sort of proof that the > book is in the public domain. Still have not made up my mind. > I will say, from experience, that I would recommend working on something > short and simple first (I didn't[1], and had to redo a lot of work before I > got it right, even sending the book through DP.) and finding someone to help > you with it; you never did say what the title of the book is or where > (approximately) you are located. I guarantee there are scans of what I wanted to do. I would be very surprised if there were not. I was planning to do some of the work by Virginia Woolf not already listed on PG. As for my location, I'm roughly 40 minutes outside of Seoul, South Korea. You'd think I could buy a scanner at any of the dozens of electronics stores around my suburb, but so far that theory has proved false. I could get one from the center of Seoul, most notably Yongsan Electronics Market. However, I have no vehicle and that means two hours or more on the subway... carrying a scanner... Eugh. From steven at desjardins.org Wed Mar 5 09:08:18 2008 From: steven at desjardins.org (Steven desJardins) Date: Wed, 5 Mar 2008 12:08:18 -0500 Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people In-Reply-To: <8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com> References: <8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com> <15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com> <8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com> Message-ID: <41fd8970803050908i69ad820crb47a3b26c74759a7@mail.gmail.com> On Wed, Mar 5, 2008 at 9:21 AM, Kionon wrote: > I guarantee there are scans of what I wanted to do. I would be very > surprised if there were not. I was planning to do some of the work by > Virginia Woolf not already listed on PG. Most of Virginia Woolf's work is not on PG because it's still under copyright in the United States. Much of her work is available from Project Gutenberg Australia. It's possible the works you're interested in have already been done. From Bowerbird at aol.com Wed Mar 5 09:27:55 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 5 Mar 2008 12:27:55 EST Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people Message-ID: robert said: > Shrug. There are several resident trolls on the list interesting interpretation, robert. i'll put my posts against yours in any test of utility... and i'm confident the future will validate my point of view. now... at any rate, as i said, kevin, i'm willing to double-check your book, and scans will make the job extremely easy. so i encourage you to make up your mind to go for it... there's no deeper way to interact with a book than typing it. if a book is meaningful to you, keying it in will be satisfying. -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/33f0e237/attachment.htm From creeva at gmail.com Wed Mar 5 11:09:13 2008 From: creeva at gmail.com (Brent Gueth) Date: Wed, 5 Mar 2008 14:09:13 -0500 Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people In-Reply-To: <8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com> References: <8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com> <15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com> <8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com> Message-ID: <2510ddab0803051109y6a89eee7k6cad5f066253dbbb@mail.gmail.com> The mailing list is something that takes time to be a bit comfortable with. I joined in 2003 and only recently have been a bit more vocal. One thing I can say, the squabbling will never end. That being said ignore the squabbles and move on with the people that do respond positively - don't ignore a difference of opinion - you just don't have to acknowledge it. On Wed, Mar 5, 2008 at 9:21 AM, Kionon wrote: > On 3/5/08, Robert Cicconetti wrote: > > Shrug. There are several resident trolls on the list, whom the list > > moderators refuse to censor; nothing most of us can do about it except > block > > their email and try to respond to other messages in a reasonable way. > > I was waiting until the squabbling ended, but since I was directly > asked... > > > What did you end up deciding about your book? You have several options > open > > to you, if the.. varied.. responses have not made you decide not to > bother. > > PG doesn't actually REQUIRE much, aside from some sort of proof that the > > book is in the public domain. > > Still have not made up my mind. > > > I will say, from experience, that I would recommend working on something > > short and simple first (I didn't[1], and had to redo a lot of work > before I > > got it right, even sending the book through DP.) and finding someone to > help > > you with it; you never did say what the title of the book is or where > > (approximately) you are located. > > I guarantee there are scans of what I wanted to do. I would be very > surprised if there were not. I was planning to do some of the work by > Virginia Woolf not already listed on PG. As for my location, I'm > roughly 40 minutes outside of Seoul, South Korea. You'd think I could > buy a scanner at any of the dozens of electronics stores around my > suburb, but so far that theory has proved false. I could get one from > the center of Seoul, most notably Yongsan Electronics Market. However, > I have no vehicle and that means two hours or more on the subway... > carrying a scanner... Eugh. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/9f49c586/attachment.htm From creeva at gmail.com Wed Mar 5 11:13:04 2008 From: creeva at gmail.com (Brent Gueth) Date: Wed, 5 Mar 2008 14:13:04 -0500 Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people In-Reply-To: References: Message-ID: <2510ddab0803051113hd6a3974n1b96081ae0389048@mail.gmail.com> Touchy that he assumes you - have a guilt complex? Actually I mostly agree with you bowerbird, I just hide in the foxholes more. I give opinions when I think they may make a difference - otherwise I keep my mouth shut and just read along. I know I hold no "weight" with you guys, and that's fine - if I can contribute one little thing that is enough for me to stay reading and keep the involvement that I do. On Wed, Mar 5, 2008 at 12:27 PM, wrote: > robert said: > > Shrug. There are several resident trolls on the list > > interesting interpretation, robert. > > i'll put my posts against yours in any test of utility... > > and i'm confident the future will validate my point of view. > > now... > > at any rate, as i said, kevin, i'm willing to double-check > your book, and scans will make the job extremely easy. > so i encourage you to make up your mind to go for it... > > there's no deeper way to interact with a book than typing it. > if a book is meaningful to you, keying it in will be satisfying. > > -bowerbird > > > > ************** > It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. > (http://money.aol.com/tax?NCID=aolprf00030000000001) > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/b318b2f5/attachment-0001.htm From Bowerbird at aol.com Wed Mar 5 13:14:56 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 5 Mar 2008 16:14:56 EST Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment Message-ID: here are the results of the "perpetual p1" test at distributed proofreaders... the experiment was geared to see if running an e-text through p1 repeatedly would produce a text that was as clean as that from the regular d.p. workflow, which consists of a p1 round, followed by a p2 round (with "better" proofers), and then a p3 round (with the "best" proofers, as tested and certified by d.p.). the first thing to note is that the proofers did an excellent job on this book... they caught numerous errors in the original p-book, not just the o.c.r. errors. in general, they should be congratulated on their fine job of proofing here... the results clearly show that repeated p1 produces text as clean as p1-p2-p3, and calls into question whether the "better" proofers are _really_ better at all... specifically, my analysis of the results shows 274 error to begin with... this 274 number does _not_ include the changes that proofers had to make in order to repair the 1,137 em-dashes in this book, which were accidentally changed to en-dashes by inappropriate handling by the content preparer... neither does it include corrections of the 504 ellipses throughout the book, which had to be "closed up" and/or changed (unnecessarily) to 4 dots, since the first of those two tasks could have been attained with one global change, and the second is totally uncalled for. finally, it does not include 715 end-line-hyphenates which proofers had to rejoin, under d.p. policy, which is unnecessary, since the machine can do it; nor does it include 74 changes to "clothe" em-dashes, as per d.p. policy. some of those numbers might be off slightly, but the overall thrust is clear; compared to the 274 _real_ errors in this book which _needed_ to be fixed, there were over _two_thousand_ unnecessary changes that had to be made, according to d.p. policy. roughly 8 unnecessary changes for every real one. this is why the d.p. workflow is so inefficient, and disrespectful of proofers... one other note: since the proofers did such a good job of finding errors in the original p-book, i've included all of those in this results write-up... it's worth reminding ourselves, though, that this is "outside of the scope" of what we consider the job of the proofers to actually be, so _reward_ them for going the extra mile, and don't dwell on what they "missed"... not that they missed all that much, mind you. so let's take a good look... *** so, how did p1 do the first time around? p1 removed 205 -- 75% -- of the 274 errors. laudable performance... *** so how did the normal workflow go after this kick-off by p1? p2 found 55 of the remaining 73 errors, a rate of 75%... again, laudable. p3 found 9 of the remaining 18 errors, a "measly" 50%... not so laudable. luckily, half of the 9 errors that p3 missed were auto-detectable... *** and how did the "perpetual p1" proofings go, in comparison? iteration#2 -- the second pass of the text through the p1 experiment -- found 55 or the remaining 73 errors, _exactly_ matching the p2 results... i2 found 40 of the same errors p2 had found, and 15 that p2 had missed. (likewise, p2 found 15 errors i2 had missed.) thus, i2's accuracy was 75%. iteration#3 -- the third cycling of the text through the p1 experiment -- is finishing as i post this, but they _almost_ matched p3 _exactly_ as well; the i3 people found 8 of the 18 errors, just 1 less than the p3 proofers... but while we're noting that the i3 proofers missed some errors, to be sure, the bright spot was that i3 also _found_ 3 new errors, which is surprising, since the "marines" of the d.p. proofers -- the p3 crew -- had missed 'em. so... what's remarkable here is that the p2 and i2 figures matched _exactly_, and the p3 and i3 figures were also _almost_identical_... it's kind of freaky. thus, again, no evidence that the p1 proofers are "inferior" in any way at all. *** curiously, a good percentage of the errors that were missed by the proofers would've been _easily_ detected by any respectable post-o.c.r. clean-up tool. which means they should've been eliminated before _any_ proofing was done. (as just one example here, there was an improper period located right in the middle of a sentence, a period which was not followed by a capitalized word. that's one of the most simple, and most predictable, tests that you can make.) and proofers might well have caught even more of the mistakes, except they were probably fatigued by all of the unnecessary changes they had to make. distributed proofreaders needs to tighten up its post-o.c.r. pre-processing. again, compared to the 274 o.c.r. errors requiring proofer action, there were over _two_thousand_ totally unnecessary changes requiring proofer action... in my opinion, that's shameful. extreme streamlining is called for, quickly! *** in sum, the text coming from 3 rounds of p1 was not significantly different from the text that was produced by the p1-p2-p3 "normal" workflow at d.p. both versions of the text had approximately 5-10 errors remaining within... for a 150-page book like this one, that is quite an acceptable rate of errors. and by doing _5_ rounds, even those 5-10 remaining errors were detected, although -- in my opinion -- it's not worth the extra work to get that level. proofers routinely spent 2-5 minutes on a page, which is a _lot_ of time... of course, even after these _5_ rounds, there might well be additional errors. indeed, i spotted 2, just by accident, in the course of conducting this review. moreover, it appears a few of the "corrections" of original p-book "errors" might have been just a touch over-zealous. (if you're curious, the "errors" on page 86 look intentional in retrospect.) that can happen sometimes... at any rate, though, proofers have done an outstanding job on this book. and the p1 proofers proved that they can keep pace with the p3 marines... -bowerbird p.s. i'll have materials documenting this analysis on my site very soon... ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/fdfcb1bf/attachment.htm From Bowerbird at aol.com Wed Mar 5 14:06:27 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 5 Mar 2008 17:06:27 EST Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people Message-ID: robert said: > Touchy that he assumes you - have a guilt complex? um, there's no "assumption" there at all, robert... these guys have been calling me a "troll" for years now. yet i put up meaty post after meaty post after meaty post, with zero response from them. except the ad hominem... if they had any logic to offer, they would. but they don't. so they throw in the occasional insult. best they can do... it doesn't bother me. the future will validate my input. and wonder why they had their heads up their butts... > I know I hold no "weight" with you guys _everyone_ holds weight in the marketplace of truth, robert... just make sure the tuning fork of truth hums when you speak. -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/c6d5951d/attachment.htm From Bowerbird at aol.com Wed Mar 5 15:42:01 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 5 Mar 2008 18:42:01 EST Subject: [gutvol-d] it's comical Message-ID: it's downright _comical_ how twisted things get over at d.p. really, it's like a car full of clowns. very entertaining... :+) take a look at the filenaming thread -- already up to 4 pages: > http://www.pgdp.net/phpBB2/viewtopic.php?p=430126#430126 you might well remember that, last week or so, i offered to write a program they could use to name their files correctly. you might have also noticed that nobody accepted that offer. thus, in that light, it's quite amusing to contrast how _quickly_ that _several_ d.p. people jumped in on a thread a while back to make the (bogus) claim that i am "unwilling to help them"... here was yet another offer to help. and again, it was ignored... just in case you're unfamiliar with it, at this point in the dance, i usually revoke my offer, and put my tool back up on the shelf. this time will be a bit different, though, for the simple reason that people who want to digitize books individually will need this tool, to save them grief from bad filenames assigned by other people... so i'm making this app generally available... *** this is a simple tool that you run against a folder full of image-files. it's built _specifically_ and _solely_ to rename the files in a scan-set. it renames sequentially-named files... for example, 001.png to 388.png might be renamed f001.png through f012.png for the forward-matter, then p001.png through p376.png for body-text. if there are unnumbered illustration plates in the middle of the book, you can specify them, and the renaming will take them into account... there's a screenshot of the interface here: > http://z-m-l.com/misc/ocr-renamer01.png you will notice that this screenshot was taken when i was renaming those "geronimo" files that jose posted a message about yesterday. you can observe that there were _unnumbered_ illustration plates after pages 18, 22, and 30, around which the renaming was done... as the screenshot shows, you can step through and _view_ files, so as to do a visible confirmation on the accuracy of the naming. this step-through capability helps ensure a scan-set is complete. and it's often extremely helpful -- even necessary -- to be able to view the files so that you can confirm what the filename should be. but very few file-renaming utilities give you the ability to do that... you can use the cursor keys to step through the images, or click the left side of the image to go back, the right side to go forward. *** as a little bonus, i put in a contextual menu which allows you to make annotations on all of the pages, to be stored in a text-file; these annotations include things like "chapter heading", "greek", "equations", "italics", and so on, info that you might wanna collect about each page, a "log" to make sure pages are handled properly. (it doesn't actually _store_ that info yet; just a little feature teaser.) a screenshot showing this contextual (right-click) menu is here: > http://z-m-l.com/misc/ocr-renamer02.png the contextual menu is located off to the right... and you can see the annotations in column 11 of the listbox... i've annotated the pages very thoroughly here, because it's easy. just right-click, then select the item from the contextual menu... you can even add items to the menu on-the-fly, when necessary. in this case, i added everything below "winter", from "contents" on. this is what allowed me to create the "quotable quote" menu item, complete with the actual quote itself. (a gimmick for this demo.) if you want to check on any of those pages, use this template: > http://z-m-l.com/go/gerst/gerstp001.jpg this second screenshot also gives you a better view concerning the _previous_ filename and the _new_ name that will be given. for example, geronf002.jpg will here be renamed gerstf002.jpg. (the first screenshot has similar info, just not so nicely arranged.) *** and in case you hadn't realized, the tool will rename the *.txt file associated with each image file, so all the names will stay in sync. *** oh yeah, if you ask the tool nicely, it will not just rename the files, but also generate a batch file that you could run on the same files located on a server, to generate the same set of new names there. because, let's be honest, it's just silly to go through the hassle of downloading and re-uploading files just to _rename_ them. silly... *** of course, you can also create a script like my app in the first place, and then just run it on the server to begin with. but my assumption is that somebody will have all of the files on their machine, such as the postprocessor. or the original content provider, so the files can be given the correct set of names before they're uploaded originally. *** i'll eventually put apps for different platforms on my website, but if anyone wants it now, just backchannel me and request a copy... -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/8addabb4/attachment-0001.htm From hyphen at hyphenologist.co.uk Thu Mar 6 01:22:30 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Thu, 6 Mar 2008 09:22:30 -0000 Subject: [gutvol-d] The Old Fashioned Way and other things. In-Reply-To: <006901c87edd$831c47c0$660fa8c0@atlanticbb.net> References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com> <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com> <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com><15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com> <001801c87c38$7b822a40$72867ec0$@co.uk> <006901c87edd$831c47c0$660fa8c0@atlanticbb.net> Message-ID: <001801c87f6b$9befee80$d3cfcb80$@co.uk> No problem, I can send the Finereader CDROM to the USA quite cheaply, just let me know your Address. It is Finereader *sprint*, not sure what version that means. The 19th century books which I have are Yorkshire Dialect works, I have been ill for some time and hope to get back to PG work. Yes I can borrow British Library http://www.bl.uk/ books from local libraries very cheaply. But the fines are horrendous. The BL is now expected to make a profit so the web site has changed and most things are charged for at an exhorbitant rate L. Use your academic identity when dealing with BL. At least I can borrow anything which is at *Boston Spa* http://www.bl.uk/services/reading/bspareadingroom.html , but I can not borrow which are at the British library in London http://www.bl.uk/services/reading/rrhome.shtm. This is too far away and I do *not* like London, and they are snooty about what you can do. I have not found how to determine from the catalogue what is at Boston Spa. I can drive to Boston Spa reading room but you have to use their copying machines, at the horrendous cost of 20 pence per photocopied A4 sheet. so going there is not worth the effort, for PG work. The UK is on copyright of Life plus 70, so if you particularly want any *one* book which is out of copyright in the UK you particularly want let me know and I will borrow it, and scan it in for you. I would not wish to get in the bad books of local libraries. If that works we can try another one. The above facilities are available to *anyone* with a local public library card in the UK, but this facility is not well known, so I have copied this to gutvol-d. Dave Fawthrop From: Norm Wolcott [mailto:nwolcott2ster at gmail.com] Sent: 05 March 2008 15:06 To: Dave Fawthrop Subject: Re: [gutvol-d] The Old Fashioned Way and other things. If you haven't found a home for your ABBY Finereader yet I would be glad to pay the postage for here in the US. I have a UK bank account and can mail you a cheque for the cost. I am still lurching along on Omnipage Pro, and have almost given up on OCR since ABBY came along and I realized I was wasting my time. On another topic, did I read on one of your earlier posts that books could be borrowed through a local library from the some storehouse of books maintained in England (BM?) and that you got several 19th cent books from there? I hope to be in the UK this spring for some Verne research, and would like to investigate this. What is the wait time etc. The BM does not allow taking photos etc of books in in their reading rooms, and I believe you have to use a book every day or back it goes to Northhampton or whatever. Norm Wolcott -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080306/bb95e3a4/attachment.htm From Bowerbird at aol.com Fri Mar 7 10:55:24 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 7 Mar 2008 13:55:24 EST Subject: [gutvol-d] perpetual comical Message-ID: it's a three ring circus over at d.p.! clowns all over the place! :+) when i first saw that the roundless experiment over there was called "perpetual p1", i gave a little laugh that it implied that they would just continue recycling a book through p1 over and over and over, forever. yeah, _that's_ the best way to solve your backlog problem! :+) but now it's getting scary, because i think they're really gonna do that! you might remember that i've written up the "final results" of this test... drew my conclusions and put it to bed, paperwork to be delivered soon. well, evidently, they're not quite done with it yet... they just sent it back for iteration#4. oh lordy lordy. and they've already placed iteration#5 on the docket... if we were to correct the errors that can be located _automatically_, there would be about 5 errors left after i3... or let's round it to 6... now, on their _last_ pass through p1, iteration#3, the proofers found _half_ of the remaining errors. so we can project iteration#4 will find _3_ out of the 6, leaving 3. and then iteration#5 would find 1 or 2... then iteration#6 might find 1. or maybe not. that's just a crap shoot. note that a pass through p1 burns about 8 hours in proofing time... let me tell you, the cost of finding that last error is gonna be a doozy! i hope it's worth it! but wait. it gets even better. because i'm talking about the _real_ errors. you know, _mistakes_. but that's not all that these proofers are changing, no sir, not at all... remember those ellipses? the ones that _could_ -- and _should_ -- have been corrected in about 10 seconds, with one global change? yep, they're still plaguing this text. so let's look at the 15 pages that had "spacey ellipses" after iteration#3: pages 1, 6, 16, 29, 49, 50, 78, 80, 88, 94, 100, 103, 115, 118, and 136. now that's 15 pages right there that are going to show "diffs" in i4 -- presuming, of course, that they are actually located, and corrected... the diffs are meaningless, of course, but nonetheless there will be diffs. and if you thought _that_ was bad, well, it gets even worse... because it seems that those very same ellipses give the p1 proofers all kinds of ways to change 'em, then change 'em to something else, then change them back again. it ends up you can do this _forever_. (and "forever" and "perpetual" are kissing cousins, don't you know?) so, in i4 so far -- with some 25 pages in -- we have cases where the proofers have _missed_ ellipses that needed to be closed up... and a case where one proofer closed up _both_ sides of an ellipse. and lots of cases where a (correct) closed-up ellipse was changed to an (incorrect) spacey ellipse. not to mention several cases where a 3-dot ellipse was changed to a 4-dot one, and vice versa. sheesh! this is madness. sheer stupidity. i apologize profusely because i evidently haven't _stressed_ the fact that a roundless system needs to have some _user-training_ so this ugly circularity won't happen. but i thought that was _obvious_. does nobody watch the process? there are other meaningless changes being made too. one of them is an old standby over at d.p. -- the blank line at the top of a page... oh, and end-line-hyphenates. it's amusing to watch 'em over time. for instance, there was a case of the end-line hyphenate of grand- father. the first proofer came along and changed it to grand-father, simply bringing up the trailing word. the next made it grand-*father, which is d.p. code for "hey post-processor, take a close look at this". the next proofer changed it to "grandfather", which might be correct, i'm not sure, i'd have to look it up in the dictionary, which, by the way, we could've had the computer do automatically, right at the _outset_, so the joining would have been correct before any proofer ever saw it, and avoided all of this mess, and thus not wasted _any_ proofer time. because ultimately we have to go to the dictionary to decide anyway, that and look at other cases in the same book, which page-at-a-time proofers can't do. so why are proofers involved in this decision at all? again, this is supreme stupidity. stupidity piled on top of stupidity, with a little bit of incompetence thrown in. how long will it go on? well hey, this is distributed proofreaders. so it might be _perpetual_... -bowerbird p.s. so far, i4 has encountered _one_ real error. they missed it... ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080307/a048a6a9/attachment.htm From Bowerbird at aol.com Fri Mar 7 14:45:01 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 7 Mar 2008 17:45:01 EST Subject: [gutvol-d] the comedy is contagious Message-ID: so, on the afternoon of march 5th, i post a message that includes screenshots from my file-renaming tool. >?? http://z-m-l.com/misc/ocr-renamer01.png >?? http://z-m-l.com/misc/ocr-renamer02.png you know the tool... the one i offered to write for d.p., but they ignored the offer, but i'm releasing it anyway? don't you know, but less than 24 hours later, we have: > http://www.pgdp.org/~dkretz/c/images_index.php?projectid=projectID466eb97ee3ca7 and suddenly the thread over at d.p. has a burst of clarity: > http://www.pgdp.net/phpBB2/viewtopic.php?p=433169#433169 not total clarity, mind you, far from it, but nonetheless, a significant leap from "muddled" to "on the right track". sometimes the comedy is contagious... -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080307/a4537b1d/attachment.htm From Bowerbird at aol.com Fri Mar 7 16:55:05 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 7 Mar 2008 19:55:05 EST Subject: [gutvol-d] data and equations and hype heals all wounds Message-ID: this is the data that i used to prepare my "final write-up" for the "perpetual p1" experiment at distributed proofreaders... side-by-side data showing the "planet strappers" changes: > http://z-m-l.com/misc/strappers73-p1p2p3.html > http://z-m-l.com/misc/strappers73-i1i2i3.html there are a couple glitches in it, but you'll get the picture... and the picture is clear. after p1 cleared up _thousands_ of errors -- yes, _literally_ thousands -- the next round (whether re-done by p1 proofers, or by the p2 proofers) located 55 of the remaining 73 errors, for roughly 75%... the round after that, whether by p1 proofers again or p3, found about _half_ of the remaining 18 errors -- 50%... the second error exposed to iteration#4 so far was _caught_, -- yay! -- meaning they've caught 1 out of 2, or, um, 50%... *** by the way, you might remember that i said that there was some formula -- when you had two independent proofings -- that would project the remaining number of expected errors, based on the (a) number of errors the independent proofings caught in common, and (b) the number unique to each of 'em. as i said, this formula was buried in some d.p. forum thread... that equation has now been dug up, independently. it's here: > http://mathworld.wolfram.com/ProofreadingMistakes.html *** so now, what should you do if you're being criticized for not respecting the time and energy of your volunteers? well, it's obvious, isn't it? you should put some p.r. out to tell your volunteers all the things, some of which they might not be aware of, that you've been hard at work at, to "make things better". because hype heals all wounds... > http://www.pgdp.net/phpBB2/viewtopic.php?t=32255 -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080307/93c05d25/attachment.htm From brett at dimetrodon.demon.co.uk Tue Mar 11 04:55:02 2008 From: brett at dimetrodon.demon.co.uk (Brett Paul Dunbar) Date: Tue, 11 Mar 2008 11:55:02 +0000 Subject: [gutvol-d] california international antiquarian book fair In-Reply-To: <41fd8970802191753x16b30534h70963259879e42ec@mail.gmail.com> References: <1913024193.267081203465434613.JavaMail.mail@webmail02> <41fd8970802191753x16b30534h70963259879e42ec@mail.gmail.com> Message-ID: Steven desJardins writes >On Feb 19, 2008 6:57 PM, Joshua Hutchinson wrote: >> I think you can easily make the argument that this old manuscript WAS >>published, though not mass produced. >> >> It was created by someone and sold to someone else (or perhaps >>created as a work for hire, etc). >> >> That rule you refer to is meant to cover things like a manuscript of >>text unpublished by the author and hidden away in an attic then found >>years later when his great-granddaughter decided to clean out the old >>family junk pile. Or maybe a scientist's lab journal that was never >>meant for public consumption, but after she became famous was >>published posthumously. IMHO, of course. > >That's a reasonable argument, but the dictionaries I consulted agree >that to "publish" something is to make it generally available to the >public. I would want to see an dictionary or legal citation before >being convinced that the sale of a unique, unpublished manuscript can >constitute publication. In English law at any rate the offence of "Publishing a Libel" can include showing a defamatory letter to your secretary even if no one else has seen it. That is one example of publishing being used for a document with an extremely limited circulation. In the context of defamation publishing means voluntarily showing the document to any other person. -- Great Internet Mersenne Prime Search http://www.mersenne.org/prime.htm Livejournal http://brett-dunbar.livejournal.com/ Brett Paul Dunbar To email me, use reply-to address From prosfilaes at gmail.com Tue Mar 11 16:12:02 2008 From: prosfilaes at gmail.com (David Starner) Date: Tue, 11 Mar 2008 19:12:02 -0400 Subject: [gutvol-d] california international antiquarian book fair In-Reply-To: References: <1913024193.267081203465434613.JavaMail.mail@webmail02> <41fd8970802191753x16b30534h70963259879e42ec@mail.gmail.com> Message-ID: <6d99d1fd0803111612i13d5f779vedcc780fa84c41eb@mail.gmail.com> From , the court decision that put A Course in Miracles in the public domain. The showing of a work to a select group of people for a limited purpose (such as to seek commentary or criticism) does not constitute "publication" within the meaning of the copyright law, and is legally insufficient to place the work into the public domain. E,g., Acad. of Motion Picture Arts and Sciences v. Creative House Promotions, Inc., 944 F.2d 1446 (9th Cir. 1991). In particular, the creator of a work has the right to show it to a limited class of people without jeopardizing the common law 24 copyright, and, under such circumstances, the publication will be deemed "limited." Id. at 1451; Proctor & Gamble Co. v. Colgate- Palmolive Co., No. 96 Civ. 9123, 1998 WL 788802, at *38 (S.D.N.Y. 1998). Such a limited publication will be found where the publication was (1) to a definitely select group, (2) for a limited purpose, and (3) without the right of diffusion, reproduction distribution or sale. White v. Kimmell, 193 F.2d 744, 746-47 (9th Cir. 1952); Continental Casualty Co. v. Beardsley, 253 F.2d 702, 706-07 (2d Cir. 1958), cert. denied, 358 U.S. 816 (1958); Proctor & Gamble Co., 1998 WL 788802 at *38. [...] "A general publication 'occurs when by the consent of the copyright owner, the original or tangible copies of a work are sold, leased, loaned, given away or otherwise made available to the general public, or when an authorized offer is made to dispose of the work in any such manner even if a sale or other such disposition does not in fact occur.'" Penguin Books U.S.A., 2000 WL 1028634, at *16 (citing Proctor and Gamble Co., 1998 WL 788802, at *38 (S.D.N.Y. 1998); Nimmer ? 4l04 at 4-20 (3d ed. 1997)). A distribution of a work to one person constitutes a publication. Kakizaki v. Riedel, 811 F. Supp. 129, 131 (S.D.N.Y. 1992); Burke v. Nat'l Broad. Co., Inc., 598 F.2d 688, 691 (1st Cir. 1979). [...] Specifically, to satisfy that a distribution qualifies as a limited publication, the plaintiffs must sustain their burden of proof to put forth evidence that the publication was (1) to a definitely select group, (2) for a limited purpose, and (3) without the right of diffusion, reproduction, distribution or sale. [...] A select group cannot be created by an author's "subjec- tive 'test of cordiality.'" Thus, when works are given or sold to persons deemed "worthy" a select call is not created and the publication is not limited. When plaintiffs sell or give the Work to "congenial strangers" the Court is "unable to see in this picture any definitely selected individuals or any limited, ascertained group or class to whom the communication was restricted." Schatt v. Curtis Mgmt. Group, Inc., 764 F. Supp. 902, 911 (S.D.N.Y. 1991) (quoting White, 193 F.2d at 747) That's what publication means in a copyright sense in the US, at least according to this judge. From richfield at telkomsa.net Wed Mar 12 00:53:41 2008 From: richfield at telkomsa.net (Jon Richfield) Date: Wed, 12 Mar 2008 09:53:41 +0200 Subject: [gutvol-d] Gothic or Gothic? Message-ID: <47D78C05.5000303@telkomsa.net> I have a canon scanner that came with an Omniscan subset, and I run them under Windows 2K. For the most part I find both of them satisfactory and sufficient, in fact, downright gratifying. I may want to convert to Linux some time, but I am too busy to sharpen my axe, so that must wait. Problem: I have a couple of books with a lot of the "old" (pre WWII mostly) German style of "Gothic" script. In particular, having holes in my head, I would love to scan in Kluge's Etymological German Dictionary as soon as I get a breather, and I might be able to get a sound copy of "Mein Kampf" as well. Unfortunately, as usual I need something free. Does anyone have any constructive suggestions, preferably for something that I can bolt onto what I have? No hurry, It won't happen this month, but if I know that I have something that works well enough to rely on, then I can scan in or photograph material against the time that I can afford to process it. BTW, just as a matter of curiosity, what is the copyright situation with Hitler's works? I know that it has lapsed in Australia and presumably Canada, but it should nominally be in copyright in the US. Is it regarded as such, and if so, is it an academic question, or would it be enforced, and if so , by whom? FTM, who enforces copyright? Is it done automatically by any authority, or must some materially interested party sue or threaten or lay criminal charges? Thanks for your attention, Jon From steven at desjardins.org Wed Mar 12 01:44:14 2008 From: steven at desjardins.org (Steven desJardins) Date: Wed, 12 Mar 2008 04:44:14 -0400 Subject: [gutvol-d] Gothic or Gothic? In-Reply-To: <47D78C05.5000303@telkomsa.net> References: <47D78C05.5000303@telkomsa.net> Message-ID: <41fd8970803120144j1f9609c6o785714ba22188b7d@mail.gmail.com> On Wed, Mar 12, 2008 at 3:53 AM, Jon Richfield wrote: > BTW, just as a matter of curiosity, what is the copyright situation with > Hitler's works? I know that it has lapsed in Australia and presumably > Canada, but it should nominally be in copyright in the US. Is it > regarded as such, and if so, is it an academic question, or would it be > enforced, and if so , by whom? According to Wikipedia, "The U.S. government seized the copyright during the Second World War as part of the Trading with the Enemy Act and in 1979, Houghton Mifflin, the U.S. publisher of the book, bought the rights from the government. " From grythumn at gmail.com Wed Mar 12 05:24:27 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Wed, 12 Mar 2008 08:24:27 -0400 Subject: [gutvol-d] Gothic or Gothic? In-Reply-To: <47D78C05.5000303@telkomsa.net> References: <47D78C05.5000303@telkomsa.net> Message-ID: <15cfa2a50803120524q5195b753t577b9d7bbdae855b@mail.gmail.com> On Wed, Mar 12, 2008 at 3:53 AM, Jon Richfield wrote: > Problem: I have a couple of books with a lot of the "old" (pre WWII > mostly) German style of "Gothic" script. In particular, having holes in > my head, I would love to scan in Kluge's Etymological German Dictionary > as soon as I get a breather, and I might be able to get a sound copy of > "Mein Kampf" as well. Unfortunately, as usual I need something free. > Does anyone have any constructive suggestions, preferably for something > that I can bolt onto what I have? > Fraktur fonts are difficult to OCR well; I have not tried in a while, but I understand older versions of OCR software actually do better (for Finereader, it was v5 or v6; can't recall) as they make fewer assumptions about the typeface. There has also been some work done on the open-source OCR engine Tesseract by piggy, a member of DP; I have not used it myself so I cannot comment on how well it works as yet. I can say that I spent many hours trying to train FR7 to understand Fraktur and other blackletter fonts, and got absolutely nowhere. R C -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080312/3e340146/attachment.htm From piggy at netronome.com Wed Mar 12 06:09:30 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 12 Mar 2008 09:09:30 -0400 Subject: [gutvol-d] Gothic or Gothic? In-Reply-To: <47D78C05.5000303@telkomsa.net> References: <47D78C05.5000303@telkomsa.net> Message-ID: <47D7D60A.4070409@netronome.com> Jon Richfield wrote: > ... > Problem: I have a couple of books with a lot of the "old" (pre WWII > mostly) German style of "Gothic" script. In particular, having holes in > my head, I would love to scan in Kluge's Etymological German Dictionary > as soon as I get a breather, and I might be able to get a sound copy of > "Mein Kampf" as well. Unfortunately, as usual I need something free. > Does anyone have any constructive suggestions, preferably for something > that I can bolt onto what I have? > The OCR package tesseract now has usable fraktur support. You want to use the deu-f language package. If you find pages that don't OCR well, send them to me and I'll fix the tesseract training to work better with them. From piggy at netronome.com Thu Mar 13 12:43:43 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Thu, 13 Mar 2008 15:43:43 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: References: Message-ID: <47D983EF.5060002@netronome.com> Great writeup! I really appreciate the detailed data analysis. Could I trouble you to add a section to http://www.pgdp.net/wiki/Confidence_in_Page_analysis ? If that's difficult, do you mind if I include an edited form of this message? I'm also very interested in your detailed list of which errors are in which categories. I'm starting to look at finer-grained page difference metrics than wdiff alterations. Do you have a tool that makes the classifications or did you do it by hand? Bowerbird at aol.com wrote: > here are the results of the "perpetual p1" test at distributed > proofreaders... > > the experiment was geared to see if running an e-text through p1 > repeatedly > would produce a text that was as clean as that from the regular d.p. > workflow, > which consists of a p1 round, followed by a p2 round (with "better" > proofers), > and then a p3 round (with the "best" proofers, as tested and certified > by d.p.). > > the first thing to note is that the proofers did an excellent job on > this book... > they caught numerous errors in the original p-book, not just the > o.c.r. errors. > in general, they should be congratulated on their fine job of proofing > here... > > the results clearly show that repeated p1 produces text as clean as > p1-p2-p3, > and calls into question whether the "better" proofers are _really_ > better at all... > > specifically, my analysis of the results shows 274 error to begin with... > > this 274 number does _not_ include the changes that proofers had to make > in order to repair the 1,137 em-dashes in this book, which were > accidentally > changed to en-dashes by inappropriate handling by the content preparer... > > neither does it include corrections of the 504 ellipses throughout the > book, > which had to be "closed up" and/or changed (unnecessarily) to 4 dots, > since > the first of those two tasks could have been attained with one global > change, > and the second is totally uncalled for. > > finally, it does not include 715 end-line-hyphenates which proofers had to > rejoin, under d.p. policy, which is unnecessary, since the machine can > do it; > nor does it include 74 changes to "clothe" em-dashes, as per d.p. policy. > > some of those numbers might be off slightly, but the overall thrust is > clear; > compared to the 274 _real_ errors in this book which _needed_ to be fixed, > there were over _two_thousand_ unnecessary changes that had to be made, > according to d.p. policy. roughly 8 unnecessary changes for every > real one. > this is why the d.p. workflow is so inefficient, and disrespectful of > proofers... > > one other note: since the proofers did such a good job of finding errors > in the original p-book, i've included all of those in this results > write-up... > it's worth reminding ourselves, though, that this is "outside of the > scope" > of what we consider the job of the proofers to actually be, so _reward_ > them for going the extra mile, and don't dwell on what they "missed"... > > not that they missed all that much, mind you. so let's take a good > look... > > *** > > so, how did p1 do the first time around? > > p1 removed 205 -- 75% -- of the 274 errors. laudable performance... > > *** > > so how did the normal workflow go after this kick-off by p1? > > p2 found 55 of the remaining 73 errors, a rate of 75%... again, laudable. > > p3 found 9 of the remaining 18 errors, a "measly" 50%... not so laudable. > > luckily, half of the 9 errors that p3 missed were auto-detectable... > > *** > > and how did the "perpetual p1" proofings go, in comparison? > > iteration#2 -- the second pass of the text through the p1 experiment -- > found 55 or the remaining 73 errors, _exactly_ matching the p2 results... > i2 found 40 of the same errors p2 had found, and 15 that p2 had missed. > (likewise, p2 found 15 errors i2 had missed.) thus, i2's accuracy was > 75%. > > iteration#3 -- the third cycling of the text through the p1 experiment -- > is finishing as i post this, but they _almost_ matched p3 _exactly_ as > well; > the i3 people found 8 of the 18 errors, just 1 less than the p3 > proofers... > > but while we're noting that the i3 proofers missed some errors, to be > sure, > the bright spot was that i3 also _found_ 3 new errors, which is > surprising, > since the "marines" of the d.p. proofers -- the p3 crew -- had missed 'em. > > so... what's remarkable here is that the p2 and i2 figures matched > _exactly_, > and the p3 and i3 figures were also _almost_identical_... it's kind > of freaky. > > thus, again, no evidence that the p1 proofers are "inferior" in any > way at all. > > *** > > curiously, a good percentage of the errors that were missed by the > proofers > would've been _easily_ detected by any respectable post-o.c.r. > clean-up tool. > which means they should've been eliminated before _any_ proofing was done. > > (as just one example here, there was an improper period located right > in the > middle of a sentence, a period which was not followed by a capitalized > word. > that's one of the most simple, and most predictable, tests that you > can make.) > > and proofers might well have caught even more of the mistakes, except they > were probably fatigued by all of the unnecessary changes they had to make. > > distributed proofreaders needs to tighten up its post-o.c.r. > pre-processing. > > again, compared to the 274 o.c.r. errors requiring proofer action, > there were > over _two_thousand_ totally unnecessary changes requiring proofer > action... > in my opinion, that's shameful. extreme streamlining is called for, > quickly! > > *** > > in sum, the text coming from 3 rounds of p1 was not significantly > different > from the text that was produced by the p1-p2-p3 "normal" workflow at d.p. > both versions of the text had approximately 5-10 errors remaining > within... > for a 150-page book like this one, that is quite an acceptable rate of > errors. > > and by doing _5_ rounds, even those 5-10 remaining errors were detected, > although -- in my opinion -- it's not worth the extra work to get that > level. > proofers routinely spent 2-5 minutes on a page, which is a _lot_ of > time... > > of course, even after these _5_ rounds, there might well be additional > errors. > indeed, i spotted 2, just by accident, in the course of conducting > this review. > moreover, it appears a few of the "corrections" of original p-book > "errors" > might have been just a touch over-zealous. (if you're curious, the > "errors" > on page 86 look intentional in retrospect.) that can happen sometimes... > > at any rate, though, proofers have done an outstanding job on this book. > and the p1 proofers proved that they can keep pace with the p3 marines... > > -bowerbird > > p.s. i'll have materials documenting this analysis on my site very > soon... > > > > ************** > It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. > (http://money.aol.com/tax?NCID=aolprf00030000000001) > > ------------------------------------------------------------------------ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From piggy at netronome.com Fri Mar 14 20:03:14 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Fri, 14 Mar 2008 23:03:14 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: References: Message-ID: <47DB3C72.8020804@netronome.com> I've put the numerical content of this posting into the CiP wiki page: http://www.pgdp.net/wiki/Confidence_in_Page_analysis#Detailed_analysis_of_PP1.2C_I1-I3 Bowerbird at aol.com wrote: > here are the results of the "perpetual p1" test at distributed > proofreaders... From hyphen at hyphenologist.co.uk Sun Mar 16 20:08:16 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Mon, 17 Mar 2008 03:08:16 -0000 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47D983EF.5060002@netronome.com> References: <47D983EF.5060002@netronome.com> Message-ID: <000301c887dc$284710c0$78d53240$@co.uk> I am an ex-engineer, where attempts at perfection are treated with derision. *Everything* is produced to a standard *good enough to do the job*, *everything* has a tolerance attached to it, something say 12 inches long may have a tolerance of 1/10,000 of an inch or indeed 1/8 of an inch depending on its proposed use. Anyone who produced a drawing asking for perfection, the saying was *dead flat and real smooth*, never heard the last of it. Has anyone done a cost/benefit analysis on second and third rounds of proofing? Is a book with 18 errors any easier to read than one with 75 errors? There are most certainly errors in the same ball court as 18 to 75, in the original printed texts of the books I read or make into e-text. Why try for perfection when the paper original is far from perfect? Would any ordinary reader notice that level of errors? Different editions of old hand composed books have different errors. I personally just ignore errors in paper or e-books. Before anyone mentions academics who must have everything perfect. What proportion of our readers are academics or terminal pedants? Why pander to the whims an unrepresentative sample of readers? Is not 75 errors per book good enough for a Science Fiction novel? Dave Fawthrop -----Original Message----- Bowerbird at aol.com wrote: > here are the results of the "perpetual p1" test at distributed > proofreaders... > > the experiment was geared to see if running an e-text through p1 > repeatedly > would produce a text that was as clean as that from the regular d.p. > workflow, > which consists of a p1 round, followed by a p2 round (with "better" > proofers), > and then a p3 round (with the "best" proofers, as tested and certified > by d.p.). > > the first thing to note is that the proofers did an excellent job on > this book... > they caught numerous errors in the original p-book, not just the > o.c.r. errors. > in general, they should be congratulated on their fine job of proofing > here... > > the results clearly show that repeated p1 produces text as clean as > p1-p2-p3, > and calls into question whether the "better" proofers are _really_ > better at all... > > specifically, my analysis of the results shows 274 error to begin with... > > this 274 number does _not_ include the changes that proofers had to make > in order to repair the 1,137 em-dashes in this book, which were > accidentally > changed to en-dashes by inappropriate handling by the content preparer... > > neither does it include corrections of the 504 ellipses throughout the > book, > which had to be "closed up" and/or changed (unnecessarily) to 4 dots, > since > the first of those two tasks could have been attained with one global > change, > and the second is totally uncalled for. > > finally, it does not include 715 end-line-hyphenates which proofers had to > rejoin, under d.p. policy, which is unnecessary, since the machine can > do it; > nor does it include 74 changes to "clothe" em-dashes, as per d.p. policy. > > some of those numbers might be off slightly, but the overall thrust is > clear; > compared to the 274 _real_ errors in this book which _needed_ to be fixed, > there were over _two_thousand_ unnecessary changes that had to be made, > according to d.p. policy. roughly 8 unnecessary changes for every > real one. > this is why the d.p. workflow is so inefficient, and disrespectful of > proofers... > > one other note: since the proofers did such a good job of finding errors > in the original p-book, i've included all of those in this results > write-up... > it's worth reminding ourselves, though, that this is "outside of the > scope" > of what we consider the job of the proofers to actually be, so _reward_ > them for going the extra mile, and don't dwell on what they "missed"... > > not that they missed all that much, mind you. so let's take a good > look... > > *** > > so, how did p1 do the first time around? > > p1 removed 205 -- 75% -- of the 274 errors. laudable performance... > > *** > > so how did the normal workflow go after this kick-off by p1? > > p2 found 55 of the remaining 73 errors, a rate of 75%... again, laudable. > > p3 found 9 of the remaining 18 errors, a "measly" 50%... not so laudable. > > luckily, half of the 9 errors that p3 missed were auto-detectable... > > *** > > and how did the "perpetual p1" proofings go, in comparison? > > iteration#2 -- the second pass of the text through the p1 experiment -- > found 55 or the remaining 73 errors, _exactly_ matching the p2 results... > i2 found 40 of the same errors p2 had found, and 15 that p2 had missed. > (likewise, p2 found 15 errors i2 had missed.) thus, i2's accuracy was > 75%. > > iteration#3 -- the third cycling of the text through the p1 experiment -- > is finishing as i post this, but they _almost_ matched p3 _exactly_ as > well; > the i3 people found 8 of the 18 errors, just 1 less than the p3 > proofers... > > but while we're noting that the i3 proofers missed some errors, to be > sure, > the bright spot was that i3 also _found_ 3 new errors, which is > surprising, > since the "marines" of the d.p. proofers -- the p3 crew -- had missed 'em. > > so... what's remarkable here is that the p2 and i2 figures matched > _exactly_, > and the p3 and i3 figures were also _almost_identical_... it's kind > of freaky. > > thus, again, no evidence that the p1 proofers are "inferior" in any > way at all. > > *** > > curiously, a good percentage of the errors that were missed by the > proofers > would've been _easily_ detected by any respectable post-o.c.r. > clean-up tool. > which means they should've been eliminated before _any_ proofing was done. > > (as just one example here, there was an improper period located right > in the > middle of a sentence, a period which was not followed by a capitalized > word. > that's one of the most simple, and most predictable, tests that you > can make.) > > and proofers might well have caught even more of the mistakes, except they > were probably fatigued by all of the unnecessary changes they had to make. > > distributed proofreaders needs to tighten up its post-o.c.r. > pre-processing. > > again, compared to the 274 o.c.r. errors requiring proofer action, > there were > over _two_thousand_ totally unnecessary changes requiring proofer > action... > in my opinion, that's shameful. extreme streamlining is called for, > quickly! > > *** > > in sum, the text coming from 3 rounds of p1 was not significantly > different > from the text that was produced by the p1-p2-p3 "normal" workflow at d.p. > both versions of the text had approximately 5-10 errors remaining > within... > for a 150-page book like this one, that is quite an acceptable rate of > errors. > > and by doing _5_ rounds, even those 5-10 remaining errors were detected, > although -- in my opinion -- it's not worth the extra work to get that > level. > proofers routinely spent 2-5 minutes on a page, which is a _lot_ of > time... > > of course, even after these _5_ rounds, there might well be additional > errors. > indeed, i spotted 2, just by accident, in the course of conducting > this review. > moreover, it appears a few of the "corrections" of original p-book > "errors" > might have been just a touch over-zealous. (if you're curious, the > "errors" > on page 86 look intentional in retrospect.) that can happen sometimes... > > at any rate, though, proofers have done an outstanding job on this book. > and the p1 proofers proved that they can keep pace with the p3 marines... > > -bowerbird > > p.s. i'll have materials documenting this analysis on my site very > soon... > > > > ************** > It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. > (http://money.aol.com/tax?NCID=aolprf00030000000001) > > ------------------------------------------------------------------------ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d From klofstrom at gmail.com Sun Mar 16 20:31:25 2008 From: klofstrom at gmail.com (Karen Lofstrom) Date: Sun, 16 Mar 2008 17:31:25 -1000 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> Message-ID: <1e8e65080803162031t598a25eewb380cfb3c340cf7d@mail.gmail.com> On Sun, Mar 16, 2008 at 5:08 PM, Dave Fawthrop wrote: > I am an ex-engineer, where attempts at perfection are treated with derision. > What proportion of our readers are academics or terminal pedants? Why pander to the whims an unrepresentative sample of readers? Dave, I'm a terminal pedant. A lot of us at DP are. Furthermore, the books we're processing are, many of them, destined only to be used by academics and pedants. Few people read Rosa Nouchette Carey for fun; if they do, they're either pedants like me, or an academic working on a book or paper. Scholars want and need accuracy in texts. Over the centuries, many person-hours have been devoted to making sure that editions are the best possible. Academics spot what seem to be errors in texts, propose emendations, and then argue about the emendations. Applying engineering standards to text is like applying lit crit to engineering. It's a category mistake. Sure, you may not care if your 1930s SF has errors, or if the abominable typesetting in the original has been emended, but an academic writing a history of SF wants good texts. Since us pedants and academics gravitate towards DP, and want to see our work USED, we are working towards academic standards. If you want to start a rival book digitization project whose motto is "Good enough, I guess," go right ahead :) Readers will download the versions that they like, as long as the versions are clearly labeled. -- Zora a pedant From steven at desjardins.org Sun Mar 16 21:22:47 2008 From: steven at desjardins.org (Steven desJardins) Date: Mon, 17 Mar 2008 00:22:47 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> Message-ID: <41fd8970803162122s65497ddcl915be8a889fc9353@mail.gmail.com> On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop wrote: > Why try for perfection when the paper original is far from perfect? > Would any ordinary reader notice that level of errors? > Different editions of old hand composed books have different errors. > I personally just ignore errors in paper or e-books. I try for perfection because _somebody_ should. I produce what may become the definitive e-text version of any particular work, read potentially by tens or hundreds of thousands of people per year. To suggest that it's too much trouble to carefully examine the book four times for defects seems preposterous. Even if it makes only a small difference to each of those readers, a small benefit multiplied by ten thousand justifies a great deal of care. > Is not 75 errors per book good enough for a Science Fiction novel? Not if I'm responsible for it, it isn't. From Morasch at aol.com Sun Mar 16 23:00:03 2008 From: Morasch at aol.com (Morasch at aol.com) Date: Mon, 17 Mar 2008 02:00:03 EDT Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment Message-ID: dave said: > I am an ex-engineer, where attempts at perfection are treated with derision. > *Everything* is produced to a standard *good enough to do the job*, > *everything* has a tolerance attached to it good point, dave. except it's already been covered, at least by me. lots of times. i've mentioned -- probably dozens of times by now -- that _my_ standard for moving text to the general public for a "continuous proofreading" stage is 1-error-in-every-10-pages. the general public will help us from then on. i've also mentioned that my tools generally produce much higher accuracy, up to something around the rate of 1-error-in-every-100-pages, meaning we can certainly start from a high point before we involve the general public. but even more important than the level of accuracy we attain, i believe, is the attitude and the infrastructure that we provide to correct any mistakes. if we make it clear to the general public that these e-books _belong_ to them, and that they are responsible for _reporting_ any errors they find in the text, and then we make it _easy_ for them to actually check text against the scans, and we make it _easy_ for them to _report_ errors, and then we make it _easy_ for them to see that all the error-reports are acted upon extremely _quickly_, then we'll have greased the skids for the e-texts to move toward perfection... currently, p.g. isn't good at doing _any_ of those things. not a one. sadly. maybe 10 years ago, that was ok. but in the wake of wikipedia, it's _not_ ok. wikipedia has shown people how _collective_responsibility_ works with text. *** steve said: > read potentially by tens or hundreds of thousands of people per year. that cuts both ways. if an e-text really has that many readers, and they feel a sense of _ownership_ of the e-text, then _they_ will help us find the errors. the problem is, we're not imbuing them with that sense of ownership. and that failure has _lots_ of implications, ranging far beyond errors... > To suggest that it's too much trouble to carefully examine the book > four times for defects seems preposterous. in one sense, yes. but in another sense, which is equally valid, all of the time and energy _unnecessarily_ spent on one book is time and energy that _could_ have been spent digitizing another. so the _correct_ answer is to spend the _proper_ time and energy on each book. but that's a rather more difficult thing to calculate. i don't think it's _impossible_ to compute it, not by any means at all. but i do know for sure your "four times" answer is not the right way; for some pages, sure. but for all of them?, no way jose... > Even if it makes only a small difference to each of those readers, > a small benefit multiplied by ten thousand justifies a great deal of care. you're badly overestimating your own importance in the overall equation... the proofing/digitization process is _not_ completed when a book is posted. it has only _begun_. and your inability to see that is what causes the problem. you are _not_ the final line. errors can be corrected long after your input... (and it's a good thing you're not the final line, because you're not as good as you think you are, not if you're the average distributed proofreader person...) *** dave said: > Has anyone done a cost/benefit analysis > on second and third rounds of proofing? o.c.r. alone -- when well-done -- can get many pages _completely_ correct. if spell-check reveals zero _flags_ (where "flags" refers to any words that are (1) not in the spell-check dictionary and (2) are low-frequency in the book), then my sense is that that page could be passed through without a check... for pages with a couple flags, i'd suggest those flags be scrutinized closely... on pages with many flags, thorough proofing of the entire page is called for. any and all changes made to a page should be verified by a second person... as long as a page has _any_ errors on it, you should assume there are more, meaning specifically (over and above verification) it should be checked again. once a page has been "certified" as "clean" without any changes made to it, i'd consider it to be "clean enough". if you want a higher degree of certainty, you could require a second "certification" without any changes being made. anything over and above that would have little claim on a cost-benefit basis. decisions made on the basis of "rounds" are fatally flawed from the outset, since some of the pages will always be easy, and some will always be hard. > Is not 75 errors per book good enough for a Science Fiction novel? for a 750-page book, maybe. but even then, i'd think we can do better. -bowerbird p.s. i see in my spam folder that zora (klofstrom) has weighed in on this thread. i would guess she's making some "d.p. has high standards of accuracy" comment. meanwhile, i examined one of her submissions and found _dozens_ of errors in it, all of which i documented on this list years ago; yet they still haven't been corrected. so let us remember that some of the people bellowing the loudest about the "quality" of their efforts -- in an effort to try to tell you what to do -- are doing slipshod work... ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/00e05a98/attachment-0001.htm From Bowerbird at aol.com Sun Mar 16 23:03:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 17 Mar 2008 02:03:03 EDT Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment Message-ID: sending again, from the proper account. sorry to confuse you... :+) -bowerbird *** dave said: >?? I am an ex-engineer, where attempts at perfection are treated with derision. >?? *Everything* is produced to a standard *good enough to do the job*, >?? *everything* has a tolerance attached to it good point, dave. except it's already been covered, at least by me.? lots of times. i've mentioned -- probably dozens of times by now -- that _my_ standard for moving text to the general public for a "continuous proofreading" stage is 1-error-in-every-10-pages.? the general public will help us from then on. i've also mentioned that my tools generally produce much higher accuracy, up to something around the rate of 1-error-in-every-100-pages, meaning we can certainly start from a high point before we involve the general public. but even more important than the level of accuracy we attain, i believe, is the attitude and the infrastructure that we provide to correct any mistakes. if we make it clear to the general public that these e-books _belong_ to them, and that they are responsible for _reporting_ any errors they find in the text, and then we make it _easy_ for them to actually check text against the scans, and we make it _easy_ for them to _report_ errors, and then we make it _easy_ for them to see that all the error-reports are acted upon extremely _quickly_, then we'll have greased the skids for the e-texts to move toward perfection... currently, p.g. isn't good at doing _any_ of those things.? not a one.? sadly. maybe 10 years ago, that was ok.? but in the wake of wikipedia, it's _not_ ok. wikipedia has shown people how _collective_responsibility_ works with text. *** steve said: >?? read potentially by tens or hundreds of thousands of people per year. that cuts both ways.? if an e-text really has that many readers, and they feel a sense of _ownership_ of the e-text, then _they_ will help us find the errors. the problem is, we're not imbuing them with that sense of ownership. and that failure has _lots_ of implications, ranging far beyond errors... >?? To suggest that it's too much trouble to carefully examine the book >?? four times for defects seems preposterous. in one sense, yes.? but in another sense, which is equally valid, all of the time and energy _unnecessarily_ spent on one book is time and energy that _could_ have been spent digitizing another. so the _correct_ answer is to spend the _proper_ time and energy on each book.? but that's a rather more difficult thing to calculate. i don't think it's _impossible_ to compute it, not by any means at all. but i do know for sure your "four times" answer is not the right way; for some pages, sure.? but for all of them?, no way jose... >?? Even if it makes only a small difference to each of those readers, >?? a small benefit multiplied by ten thousand justifies a great deal of care. you're badly overestimating your own importance in the overall equation... the proofing/digitization process is _not_ completed when a book is posted. it has only _begun_.? and your inability to see that is what causes the pro blem. you are _not_ the final line.? errors can be corrected long after your input... (and it's a good thing you're not the final line, because you're not as good as you think you are, not if you're the average distributed proofreader person...) *** dave said: >?? Has anyone done a cost/benefit analysis >?? on second and third rounds of proofing? o.c.r. alone -- when well-done -- can get many pages _completely_ correct. if spell-check reveals zero _flags_ (where "flags" refers to any words that are (1) not in the spell-check dictionary and (2) are low-frequency in the book), then my sense is that that page could be passed through without a check... for pages with a couple flags, i'd suggest those flags be scrutinized closely... on pages with many flags, thorough proofing of the entire page is called for. any and all changes made to a page should be verified by a second person... as long as a page has _any_ errors on it, you should assume there are more, meaning specifically (over and above verification) it should be checked again. once a page has been "certified" as "clean" without any changes made to it, i'd consider it to be "clean enough".? if you want a higher degree of certainty, you could require a second "certification" without any changes being made. anything over and above that would have little claim on a cost-benefit basis. decisions made on the basis of "rounds" are fatally flawed from the outset, since some of the pages will always be easy, and some will always be hard. >?? Is not 75 errors per book good enough for a Science Fiction novel? for a 750-page book, maybe.? but even then, i'd think we can do better. -bowerbird p.s.? i see in my spam folder that zora (klofstrom) has weighed in on this thread. i would guess she's making some "d.p. has high standards of accuracy" comment. meanwhile, i examined one of her submissions and found _dozens_ of errors in it, all of which i documented on this list years ago; yet they still haven't been corrected. so let us remember that some of the people bellowing the loudest about the "quality" of their efforts -- in an effort to try to tell you what to do -- are doing slipshod work... ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/456b577e/attachment.htm From hyphen at hyphenologist.co.uk Mon Mar 17 00:39:39 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Mon, 17 Mar 2008 07:39:39 -0000 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <41fd8970803162122s65497ddcl915be8a889fc9353@mail.gmail.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <41fd8970803162122s65497ddcl915be8a889fc9353@mail.gmail.com> Message-ID: <000f01c88802$145228e0$3cf67aa0$@co.uk> Then you are admittedly a terminal pedant and spend IMO too much time and effort on proofreading. Dave Fawthrop On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop wrote: > Why try for perfection when the paper original is far from perfect? > Would any ordinary reader notice that level of errors? > Different editions of old hand composed books have different errors. > I personally just ignore errors in paper or e-books. I try for perfection because _somebody_ should. I produce what may become the definitive e-text version of any particular work, read potentially by tens or hundreds of thousands of people per year. To suggest that it's too much trouble to carefully examine the book four times for defects seems preposterous. Even if it makes only a small difference to each of those readers, a small benefit multiplied by ten thousand justifies a great deal of care. > Is not 75 errors per book good enough for a Science Fiction novel? Not if I'm responsible for it, it isn't. From steven at desjardins.org Mon Mar 17 00:50:14 2008 From: steven at desjardins.org (Steven desJardins) Date: Mon, 17 Mar 2008 03:50:14 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000f01c88802$145228e0$3cf67aa0$@co.uk> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <41fd8970803162122s65497ddcl915be8a889fc9353@mail.gmail.com> <000f01c88802$145228e0$3cf67aa0$@co.uk> Message-ID: <41fd8970803170050y7c9fbb58l1ad0b0d9314daa7e@mail.gmail.com> On Mon, Mar 17, 2008 at 3:39 AM, Dave Fawthrop wrote: > Then you are admittedly a terminal pedant and spend IMO too much > time and effort on proofreading. I don't think you know what the word "admittedly" means, since I've made no claims of pedantry, only of aspirations towards accuracy. (But I could be wrong; maybe you don't know what a "pedant" is.) In my opinion, the time and effort I spend on proofreading is well spent. Since it's my time and my effort, I think my opinion counts somewhat more highly than yours. From ralf at ark.in-berlin.de Mon Mar 17 01:32:47 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 17 Mar 2008 09:32:47 +0100 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: References: Message-ID: <20080317083247.GA5920@ark.in-berlin.de> Morasch wrote > wikipedia has shown people how _collective_responsibility_ works with text. What nonsense. Wikisource texts have more errors than even pre-10k PG eTexts. That's also because the interface for working with BOTH text and page image is so bad. Additionally, they have only two rounds of proofreading, compared to SEVEN at DP. ralf From bzg at altern.org Mon Mar 17 01:44:31 2008 From: bzg at altern.org (Bastien Guerry) Date: Mon, 17 Mar 2008 08:44:31 +0000 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <1e8e65080803162031t598a25eewb380cfb3c340cf7d@mail.gmail.com> (Karen Lofstrom's message of "Sun, 16 Mar 2008 17:31:25 -1000") References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <1e8e65080803162031t598a25eewb380cfb3c340cf7d@mail.gmail.com> Message-ID: <87eja9n3io.fsf@bzg.ath.cx> "Karen Lofstrom" writes: > Applying engineering standards to text is like applying lit crit to > engineering. It's a category mistake. No. You are confusing "text" and the process DP go through when producing text. Applying engineering standards to this process sounds perfectly reasonable to me. (Ryle would be scared how his concept of "category mistake" is now so mainstream that people are abusing it. That was for the pedantic note.) As far as I know, thinking in terms of "good enough" doesn't prevent anyone to try to improve a system -- yes, even stupid enginers want to improve machines! And the will to improve something always calls for a direction. So "good enough" qualifies what DP does today, and stating this doesn't prevent people from trying to improve it. -- Bastien From traverso at posso.dm.unipi.it Mon Mar 17 02:09:31 2008 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Mon, 17 Mar 2008 10:09:31 +0100 (CET) Subject: [gutvol-d] ***SPAM*** Re: a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk> (hyphen@hyphenologist.co.uk) References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> Message-ID: <20080317090931.14A8D93B61@posso.dm.unipi.it> >>>>> "Dave" == Dave Fawthrop writes: Dave> Is not 75 errors per book good enough for a Science Fiction Dave> novel? Only if the novel is 750 pages, if it is 150 pages I will not go on reading for pleasure after page 10, I would rather read something else. Too many errors make reading annoying. In any case I believe that a further proofreading that can bring the error to 15, or one every 50 pages, is well spent. I agree that many old books are much worse that an error every 10 pages, and also agree that, for easy reading, some kind of harder to spot transcription errors do not matter much. And agree that a cost/benefit analysis coud be beneficial, but PG has to make his own analysis, and every volunteer can make his own too. I believe that if the whitewashers suspect that the error ratio exceeds one error every two pages the submission is resent to the contributor without much thinking. Carlo From julio.reis at tintazul.com.pt Mon Mar 17 03:36:40 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Mon, 17 Mar 2008 10:36:40 +0000 Subject: [gutvol-d] gutvol-d Digest, Vol 44, Issue 19 In-Reply-To: References: Message-ID: <1205750200.7554.53.camel@abetarda.mshome.net> > Is not 75 errors per book good enough for a Science Fiction novel? Of course it is, fellow engineer! Because your SF novel is 1,500 pages long, right? So, one mistake every 20 pages, that's very good. Unless you *can* find out those 75 errors, in which case... go and get them. Producing accurate texts is great, as long as it doesn't get to the point of faithfully reproducing the errors found in the original. It's all right to correct those and leave transcriber's notes to that effect. That said... the amount of effort put in *must* be measured and feel "decent" in proportion to the benefit derived. How many days would it take to find those 75 errors? Some here would say no one's rushing you, but that's beside the point IMHO. You *need* to finish stuff to get on with the next project. So, accurate yes. Quick, if possible. From walter.van.holst at xs4all.nl Mon Mar 17 05:36:11 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Mon, 17 Mar 2008 13:36:11 +0100 Subject: [gutvol-d] OCR OS Xquestion Message-ID: <47DE65BB.1040402@xs4all.nl> L.S., I will be asking the same question on the DP-fora, what OCR software would one recommend on Mac OS X? Is IRIS any good? Regards, Walter From joshua at hutchinson.net Mon Mar 17 05:51:49 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Mon, 17 Mar 2008 12:51:49 +0000 (GMT) Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment Message-ID: <1751743406.11431205758310116.JavaMail.mail@webmail02> Oh hell no. You did NOT just label PP and PPV at DP as proofreading rounds. Believe me, PP and PPV are not and have never been meant to be true proofreading rounds (and the reason we left the old 2 round system was to get away from the necessity of proofreading in PP and PPV). I now return you to the age old argument over quality vs quantity. Josh On Mar 17, 2008, ralf at ark.in-berlin.de wrote: Morasch wrote > wikipedia has shown people how _collective_responsibility_ works with text. What nonsense. Wikisource texts have more errors than even pre-10k PG eTexts. That's also because the interface for working with BOTH text and page image is so bad. Additionally, they have only two rounds of proofreading, compared to SEVEN at DP. ralf _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d From piggy at netronome.com Mon Mar 17 06:57:27 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 17 Mar 2008 09:57:27 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> Message-ID: <47DE78C7.60700@netronome.com> Dave Fawthrop wrote: > ... > Anyone who produced a drawing asking for perfection, the saying was > *dead flat and real smooth*, never heard the last of it. > > Has anyone done a cost/benefit analysis on second and third rounds of > proofing?... > That is a significant part of what I am trying to do. Right now I am attempting to build the apparatus to estimate the cost part. The primary unit of cost is human lifetime, i.e. proofer-time. To see my current thinking, I would encourage folks to read http://www.pgdp.net/wiki/Confidence_in_Page_analysis#The_Ferguson-Hardwick_Algorithm . The benefit side is a little harder. I think I have a handle on calculating effectiveness--the number and possible kinds of errors we remove. But actually calculating a cost to undiscovered misprints is much more difficult. We need to be able to compare the cost of undiscovered misprints with the cost of doing the work. Can anybody think of a way of quantifying the cost of undiscovered misprints in terms of human lifetime? Clearly there is a significant difference between Rosa Nouchette Carey and Raymond Zinke Gallun. I think Mr. Gallun himself would have agreed. Where can we get data to differentiate these two cases? To those seeking perfection: I'm sorry, but the tools to confirm perfection don't exist. Our current understanding of the universe limits us to deciding how close to perfection we probably are. What is the best way to spend our lives? I can't claim to offer a general solution, but I'm trying my darndest to offer a quantitative recommendation on how to spend that portion of your life you see fit to dedicate to proofing books at PGDP. From piggy at netronome.com Mon Mar 17 07:04:08 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 17 Mar 2008 10:04:08 -0400 Subject: [gutvol-d] gutvol-d Digest, Vol 44, Issue 19 In-Reply-To: <1205750200.7554.53.camel@abetarda.mshome.net> References: <1205750200.7554.53.camel@abetarda.mshome.net> Message-ID: <47DE7A58.5080601@netronome.com> J?lio Reis wrote: >> Is not 75 errors per book good enough for a Science Fiction novel? >> > ... > That said... the amount of effort put in *must* be measured and feel > "decent" in proportion to the benefit derived. How many days would it > take to find those 75 errors? Some here would say no one's rushing you, > but that's beside the point IMHO. You *need* to finish stuff to get on > with the next project. > > So, accurate yes. Quick, if possible. > In the US, we have until 2019 to catch up. Congress did us the "favor" of freezing the Public Domain. From hyphen at hyphenologist.co.uk Mon Mar 17 08:56:34 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Mon, 17 Mar 2008 15:56:34 -0000 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47DE78C7.60700@netronome.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> Message-ID: <000901c88847$7a8066f0$6f8134d0$@co.uk> La Monte H.P. Yarroll wrote Dave Fawthrop wrote: >> ... >> Anyone who produced a drawing asking for perfection, the saying was >> *dead flat and real smooth*, never heard the last of it. >> >> Has anyone done a cost/benefit analysis on second and third rounds of >> proofing?... >> >That is a significant part of what I am trying to do. >Right now I am attempting to build the apparatus to estimate the cost >part. The primary unit of cost is human lifetime, i.e. proofer-time. In the UK we have a national minimum wage which is now slightly more than GBP 5 per hour. Being retired, I expect someone somewhere to get 5 GBP for every hour Which I spend doing voluntary work. Dave Fawthrop From hyphen at hyphenologist.co.uk Mon Mar 17 09:00:04 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Mon, 17 Mar 2008 16:00:04 -0000 Subject: [gutvol-d] OCR OS Xquestion In-Reply-To: <47DE65BB.1040402@xs4all.nl> References: <47DE65BB.1040402@xs4all.nl> Message-ID: <000a01c88847$f7c445f0$e74cd1d0$@co.uk> Walter van Holst wrote >L.S., >I will be asking the same question on the DP-fora, what OCR software >would one recommend on Mac OS X? Is IRIS any good? I ditched IRIS on a PC in favour of Abbyy finereader which is IME much better. Dave Fawthrop From walter.van.holst at xs4all.nl Mon Mar 17 09:24:34 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Mon, 17 Mar 2008 17:24:34 +0100 Subject: [gutvol-d] OCR OS Xquestion In-Reply-To: <000a01c88847$f7c445f0$e74cd1d0$@co.uk> References: <47DE65BB.1040402@xs4all.nl> <000a01c88847$f7c445f0$e74cd1d0$@co.uk> Message-ID: <47DE9B42.2070504@xs4all.nl> Dave Fawthrop wrote: >> I will be asking the same question on the DP-fora, what OCR software >> would one recommend on Mac OS X? Is IRIS any good? > > I ditched IRIS on a PC in favour of Abbyy finereader which is IME much > better. There is no Abby Finereader edition available for OS X anymore, therefore my question. Regards, Walter From piggy at netronome.com Mon Mar 17 10:02:43 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 17 Mar 2008 13:02:43 -0400 Subject: [gutvol-d] OCR OS Xquestion In-Reply-To: <47DE9B42.2070504@xs4all.nl> References: <47DE65BB.1040402@xs4all.nl> <000a01c88847$f7c445f0$e74cd1d0$@co.uk> <47DE9B42.2070504@xs4all.nl> Message-ID: <47DEA433.4000204@netronome.com> Walter van Holst wrote: > Dave Fawthrop wrote: > > >>> I will be asking the same question on the DP-fora, what OCR software >>> would one recommend on Mac OS X? Is IRIS any good? >>> >> I ditched IRIS on a PC in favour of Abbyy finereader which is IME much >> better. >> > > There is no Abby Finereader edition available for OS X anymore, > therefore my question. > > Regards, > > Walter > Has anybody tried tesseract OCR on Mac OS X? It's not an officially supported platform, but Wikipedia claims folks have used it successfully. From Bowerbird at aol.com Mon Mar 17 10:10:04 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 17 Mar 2008 13:10:04 EDT Subject: [gutvol-d] OCR OS Xquestion Message-ID: iris is a waste of money _and_ time. -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/0f8de921/attachment.htm From steven at desjardins.org Mon Mar 17 10:12:13 2008 From: steven at desjardins.org (Steven desJardins) Date: Mon, 17 Mar 2008 13:12:13 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000901c88847$7a8066f0$6f8134d0$@co.uk> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> Message-ID: <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> On Mon, Mar 17, 2008 at 11:56 AM, Dave Fawthrop wrote: > In the UK we have a national minimum wage which is now slightly more > than GBP 5 per hour. > > Being retired, I expect someone somewhere to get 5 GBP for every hour > Which I spend doing voluntary work. So if a basic proofreading job (P1) is worth, say, 10p. to the average reader, and takes 10 hours of volunteer time, you would say that it's justified if the resulting e-book has 500 readers. And if a better proofreader job (P2) were worth only an additional 1p., and took 15 hours, the e-book would need 7500 readers for you to consider it justified. And if a thoroughly nitpicky proofreading job (P3 and PP) were worth only 0.1p. and took 30 additional hours, the e-book would need 150,000 readers. I think a good proofreading job is worth more than that, not necessarily to the median reader (who, like you, may be indifferent), but to a minority who does care highly enough about quality, and who place a high enough value on e-books, to radically boost the mean. But even using a very low valuation, I don't see how you can justify leaving 75 errors in a book; are you really suggesting that it's worth less than 1p. per reader to put the novel at least through P2? (I don't agree with your metric, by the way--I think reducing every activity to economic value is tunnel-visioned and unsophisticated--but it seems more to undermine your position than to support it.) From piggy at netronome.com Mon Mar 17 10:18:39 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 17 Mar 2008 13:18:39 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000901c88847$7a8066f0$6f8134d0$@co.uk> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> Message-ID: <47DEA7EF.5000706@netronome.com> Dave Fawthrop wrote: > La Monte H.P. Yarroll wrote > > Dave Fawthrop wrote: > >>> ... >>> Anyone who produced a drawing asking for perfection, the saying was >>> *dead flat and real smooth*, never heard the last of it. >>> >>> Has anyone done a cost/benefit analysis on second and third rounds of >>> proofing?... >>> >>> > > >> That is a significant part of what I am trying to do. >> > > >> Right now I am attempting to build the apparatus to estimate the cost >> part. The primary unit of cost is human lifetime, i.e. proofer-time. >> > > In the UK we have a national minimum wage which is now slightly more > than GBP 5 per hour. > > Being retired, I expect someone somewhere to get 5 GBP for every hour > Which I spend doing voluntary work. > > Dave Fawthrop > "Cost functions" in statistics are rarely denominated in currency. In this case, there is no money changing hands, so there is little value in converting time to currency. We want "cost functions" so that we can answer the question "Is the expected result of another round of proofreading worth the effort it will take?" From Bowerbird at aol.com Mon Mar 17 10:51:39 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 17 Mar 2008 13:51:39 EDT Subject: [gutvol-d] 75 errors in a book -- putting things back into perspective Message-ID: ok, let's put things back into perspective, ok? i'm happy to see the d.p. people are getting all righteous about dave's suggestion that 75 errors in a book might be acceptable. now, how about showing some similar outrage over the fact that -- in the "perpetual p1" book -- your proofers had to find and fix _one_thousand_one_hundred_and_thirty_seven_ (1,137) em-dash errors _introduced_ into the document by an incompetent person? where is the big huff that proofers had to do _over_seven_hundred_ (700+) unnecessary "corrections" to rejoin end-of-line hyphenates, which could have been done by the computer instead, in seconds? and why is there no complaining about the fact that the proofers had to "clothe" 75 end-of-line em-dashes, for no good reason? and what about the 500+ ellipses that required "corrections" too? those could have been handled easily by automated routines too. these are the _real_ problems with the workflow at d.p.! _8_ unnecessary fixes required for every _1_ o.c.r. error! if you're going to consider the "cost" of doing a round of proofing, then separate out the cost of these _needless_ changes beforehand. once you've done that, you'll see the cost of _correcting_the_o.c.r._ is -- in a relative sense -- a very small cost indeed... the solution is clear -- distributed proofreaders, fix your workflow! -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/4a420954/attachment.htm From marcello at perathoner.de Mon Mar 17 10:57:41 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 17 Mar 2008 18:57:41 +0100 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47DE78C7.60700@netronome.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> Message-ID: <47DEB115.604@perathoner.de> La Monte H.P. Yarroll wrote: > What is the best way to spend our lives? I can't claim to offer a > general solution, but I'm trying my darndest to offer a quantitative > recommendation on how to spend that portion of your life you see fit to > dedicate to proofing books at PGDP. Has anybody yet come up with the revolutionary idea that people might proofread books because they have fun? -- Marcello Perathoner webmaster at gutenberg.org From ralf at ark.in-berlin.de Mon Mar 17 11:11:55 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 17 Mar 2008 19:11:55 +0100 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <1751743406.11431205758310116.JavaMail.mail@webmail02> References: <1751743406.11431205758310116.JavaMail.mail@webmail02> Message-ID: <20080317181155.GB7041@ark.in-berlin.de> > Believe me, PP and PPV are not and have never been meant to be true proofreading rounds (and the reason we left the old 2 round system was to get away from the necessity of proofreading in PP and PPV). Josh, I know why you say this. Because people should concentrate on the specialty of each round. But you know quite well that you as PP are responsible for anything that gets through. Also, you can't help reading while working on it, and you can't help correcting when something hits your eyes. ralf From Bowerbird at aol.com Mon Mar 17 12:17:38 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 17 Mar 2008 15:17:38 EDT Subject: [gutvol-d] "perpetual" documentation of the errors found post-p1 Message-ID: here are the latest reports on the "perpetual p1" experiment: > http://z-m-l.com/misc/strappers77-p1p2p3.html > http://z-m-l.com/misc/strappers77-i1i2i3i4.html p1 corrected about 222 o.c.r. errors -- plus thousands of other "errors". p1 missed 77 errors discovered by later proofers, as documented below. p2 fixed 55 errors; there were 15 unique ones and 40 in common with i2. likewise, i2 fixed 55 errors -- 15 unique ones and 40 in common with p2. taken together, then, p2 and i2 found 70 of the 77 errors which they faced. p3 found 10 of the 22 errors with which it was faced, finding 3 unique ones. i3 found 8 of the 22 errors with which it was faced, also finding 3 unique ones. p3 and i3 left 5 undetectable o.c.r. errors each; they were completely different. taken together, p3 and i3 found 6 unique errors, and left only 1 undiscovered... i4 found 7 of the 14 errors with which it was faced, finding the last unique one. (since that last error was a misspelled word, spellcheck also would've caught it.) yes, that's right, 6 rounds of proofers missed a word that didn't pass spellcheck! of the 7 errors missed by i4, only one of them was an undetectable o.c.r. error, and the p2 proofers had caught that one in their round... a complete list of the ~300 o.c.r. errors will be compiled and posted soon, and a clean version of the text posted. after that, i'll probably put this baby to bed... -bowerbird p.s. as for meaningless changes, i4 missed 12 of the 15 pages with spacey ellipses: 1 no 6 no 16 yes! 29 no 49 yes! 50 yes! 78 no 80 no 88 no 94 no 100 no 103 no 115 no 118 no 136 no ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/3ba7eae7/attachment.htm From prosfilaes at gmail.com Mon Mar 17 15:34:45 2008 From: prosfilaes at gmail.com (David Starner) Date: Mon, 17 Mar 2008 18:34:45 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> Message-ID: <6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com> On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop wrote: > I am an ex-engineer, where attempts at perfection are treated with derision. Really? As far as I can tell, Boeing works pretty hard to make sure that _no_ planes fall out of the sky. Nuclear power plants are designed so that _none_ of them blow up. A perfect etext is possible and could be achieved. > Is a book with 18 errors any easier to read than one with 75 errors? Of course. The fewer times I get yanked out of the story by a typo, the better. It's unacceptable for me to have to deal with a typo that obscures what the original said. > Before anyone mentions academics who must have everything perfect. > What proportion of our readers are academics or terminal pedants? > Why pander to the whims an unrepresentative sample of readers? I don't think readers that object to typos is all that an unrepresentative sample of readers. Furthermore, especially in a volunteer organization, it's a reasonable goal to turn out material that will make the producers happy, even if the audience would settle for something lesser. From piggy at netronome.com Mon Mar 17 20:01:43 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 17 Mar 2008 23:01:43 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47DEB115.604@perathoner.de> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <47DEB115.604@perathoner.de> Message-ID: <47DF3097.2000501@netronome.com> Marcello Perathoner wrote: > La Monte H.P. Yarroll wrote: > > >> What is the best way to spend our lives? I can't claim to offer a >> general solution, but I'm trying my darndest to offer a quantitative >> recommendation on how to spend that portion of your life you see fit to >> dedicate to proofing books at PGDP. >> > > Has anybody yet come up with the revolutionary idea that people might > proofread books because they have fun? > Absolutely! If we could denominate the time spent proofing in units of fun and escaped misprints in units of un-fun, we'd have a viable pair of cost functions that would help us decide when to stop proofreading a particular book. I for one get a small kick out of finding an obscure misprint, but a much larger kick from seeing a book I worked on posted to PG. If I could quantify these "kicks" I'd also have a good start on the cost functions I'm looking for. From piggy at netronome.com Mon Mar 17 20:12:17 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 17 Mar 2008 23:12:17 -0400 Subject: [gutvol-d] "perpetual" documentation of the errors found post-p1 In-Reply-To: References: Message-ID: <47DF3311.3080902@netronome.com> Bowerbird at aol.com wrote: > ,,, > p.s. as for meaningless changes, i4 missed 12 of the 15 pages with > spacey ellipses: > 1 no > 6 no > 16 yes! > 29 no > 49 yes! > 50 yes! > 78 no > 80 no > 88 no > 94 no > 100 no > 103 no > 115 no > 118 no > 136 no Don't hold it against them--after about page 80 I added the following note to the project guidelines: Please ignore ellipses. Leave them as they currently stand. They will be handled in PP. Your analysis suggests that people are reading the guidelines. From piggy at netronome.com Mon Mar 17 20:41:33 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 17 Mar 2008 23:41:33 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47DF3097.2000501@netronome.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <47DEB115.604@perathoner.de> <47DF3097.2000501@netronome.com> Message-ID: <47DF39ED.4050103@netronome.com> La Monte H.P. Yarroll wrote: > Marcello Perathoner wrote: > >> La Monte H.P. Yarroll wrote: >> >> >> >>> What is the best way to spend our lives? I can't claim to offer a >>> general solution, but I'm trying my darndest to offer a quantitative >>> recommendation on how to spend that portion of your life you see fit to >>> dedicate to proofing books at PGDP. >>> >>> >> Has anybody yet come up with the revolutionary idea that people might >> proofread books because they have fun? >> >> > > Absolutely! If we could denominate the time spent proofing in units of > fun and escaped misprints in units of un-fun, we'd have a viable pair of > cost functions that would help us decide when to stop proofreading a > particular book. > Oops. That's not quite right. Surely it's more fun to proof different books than to proof the same book over and over. How much less fun is it to proof the same book over one more time? When that fun drops below the sum total of unfun we would get from the likely number of remaining errors, then we are done. Different people enjoy specific kinds of books. This suggests that a fun metric would be proofer and book specific. Could we ask people how much fun they had proofing a particular page? Could we get a complementary rating from people who find errors in posted PG texts? Another possibility is to equate attention with fun. If people stop paying attention to a book, we could presume that they no longer find it fun. The amount of daily attention a book gets could be compared to the mean for all projects. Diminishing attention could be treated as diminishing fun, i.e. rising cost. This could conceivably let us ignore the cost of missed misprints completely. I really like the idea of trying to optimize PGDP for fun. More suggests are solicited! From piggy at netronome.com Mon Mar 17 20:54:26 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 17 Mar 2008 23:54:26 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> Message-ID: <47DF3CF2.6040103@netronome.com> Steven desJardins wrote: > ... > I think a good proofreading job is worth more than that, not > necessarily to the median reader (who, like you, may be indifferent), > but to a minority who does care highly enough about quality, and who > place a high enough value on e-books, to radically boost the mean. But > even using a very low valuation, I don't see how you can justify > leaving 75 errors in a book; are you really suggesting that it's worth > less than 1p. per reader to put the novel at least through P2? > ... Thanks, that reminds me of a fallacy I've been meaning to point out to folks. The number of final errors in a book is not predominantly a function of the last round it finishes. The initial number of errors in the book is very important. There are a handful of books which come out of P1 with phenomenally low error rates. There are also a handful of books which finish P3 with error rates comparable to really poor OCR. Much more important than finishing a certain number of rounds is to actually predict the likely number of remaining errors in a specific text (which we can do with moderate reliability) and then decide which kind of round to subject it to. Examine these pictures to visualize the problem: http://www.pgdp.net/wiki/Confidence_in_Page_analysis#Changes_III From Bowerbird at aol.com Mon Mar 17 22:18:52 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 18 Mar 2008 01:18:52 EDT Subject: [gutvol-d] the o.c.r. errors in the "perpetual p1" experiment Message-ID: here are the o.c.r. errors in the "perpetual p1" experiment: > http://z-m-l.com/misc/strapper269errors.html 269 lines were changed from the o.c.r. to the "final" version. note that these are o.c.r. errors _only_. there's no computation of em-dashes, clothing of em-dashes, rejoined end-of-line hyphenates, ellipses, or asterisks (notes). in other words, assume capable handling by the content provider, and an intelligent workflow... if we assume about 10% of these lines contain more than 1 error, then we've got about 300 separate errors here. counting the 77 errors that were found after that initial p1 round, that means p1 fixed around 225 of the 300, giving an accuracy rate of 75%. the p2/i2 rounds, which each caught 55 of the remaining 77 errors, thus had an accuracy rate of 70%. most of the rounds after that had an accuracy rate around 50%... the i2/i3/i4 "iterations" did _not_ have the benefit of wordcheck -- the "good" and "bad" word-lists were not maintained for them -- which makes their ability to keep pace with p2/p3 more remarkable. -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/a95eba28/attachment-0001.htm From hyphen at hyphenologist.co.uk Tue Mar 18 00:22:52 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Tue, 18 Mar 2008 07:22:52 -0000 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com> Message-ID: <000001c888c8$e345ac50$a9d104f0$@co.uk> David Starner wrote On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop wrote: >> I am an ex-engineer, where attempts at perfection are treated with derision. >Really? As far as I can tell, Boeing works pretty hard to make sure >that _no_ planes fall out of the sky. As an example Aircraft Engines fail occasion, which is why passenger aircraft have at least two engines, when one fails the planes will still get down safely without that engine. Landing with a dead engine is far from perfection. I can only remember one case where two engines failed at the same time. http://images.cnn.com/2008/WORLD/europe/01/18/heathrow.incident/index.html We do not yet know why. >Nuclear power plants are >designed so that _none_ of them blow up. Three Mile Island, Chernobyl, Windscale for example. Dave Fawthrop From traverso at posso.dm.unipi.it Tue Mar 18 00:31:36 2008 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Tue, 18 Mar 2008 08:31:36 +0100 (CET) Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000001c888c8$e345ac50$a9d104f0$@co.uk> (hyphen@hyphenologist.co.uk) References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com> <000001c888c8$e345ac50$a9d104f0$@co.uk> Message-ID: <20080318073136.3AC5E93B61@posso.dm.unipi.it> >>>>> "Dave" == Dave Fawthrop writes: Dave> David Starner wrote Dave> On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop Dave> wrote: >>> I am an ex-engineer, where attempts at perfection are treated >>> with Dave> derision. >> Really? As far as I can tell, Boeing works pretty hard to make >> sure that _no_ planes fall out of the sky. Dave> As an example Aircraft Engines fail occasion, which is why Dave> passenger aircraft have at least two engines, when one fails Dave> the planes will still get down safely without that engine. Dave> Landing with a dead engine is far from perfection. And for the same rason there are several round of proofreading. Proposing to post books with 75 errors in 150 pages is like proposing to have aircrafts that crash when an engine fails. It can be done, but nobody is willing to use them. Carlo From Bowerbird at aol.com Tue Mar 18 01:51:29 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 18 Mar 2008 04:51:29 EDT Subject: [gutvol-d] fully informed of the d.p. follies Message-ID: carlo said: > Proposing to post books with 75 errors in 150 pages is like > proposing to have aircrafts that crash when an engine fails. > It can be done, but nobody is willing to use them. notice how willingly the d.p. people jump on the false issues, and contrast that with their silence when it comes to the _real_ ones... so far i've done them the favor of making my points on this list, one that is hidden to google's robots behind a subscriber wall... but starting this spring -- which begins this friday, i believe -- i'll be reposting my messages on a blog that google will crawl, so the world finally becomes fully informed of the d.p. follies... just to remind them, off the dome, this is what needs to be done: 1. ensure you have decent scans, and name them intelligently. 2. use a decent o.c.r. program, and ensure quality results. 3. do not tolerate bad text handling by content providers. 4. do a decent post-o.c.r. cleanup, before _any_ proofing. 5. retain linebreaks (don't rejoin hyphenates or clothe em-dashes). 6. change the ridiculous ellipse policy to something sensible. 7. stop doing small-cap markup with no semantic meaning. 8. i forget what 8 was for. 9. retain pagenumber information, in an unobtrusive manner. 10. format the ascii version using light markup, for auto-html. -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/2292eebd/attachment.htm From schultzk at uni-trier.de Tue Mar 18 03:25:04 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Tue, 18 Mar 2008 11:25:04 +0100 Subject: [gutvol-d] OCR OS Xquestion In-Reply-To: <47DE65BB.1040402@xs4all.nl> References: <47DE65BB.1040402@xs4all.nl> Message-ID: <841F5114-875F-41D3-B3D1-D7FA85748475@uni-trier.de> Hi Walter, I use OmniPage. not perfect, but does the job for me. Keith. Am 17.03.2008 um 13:36 schrieb Walter van Holst: > L.S., > > I will be asking the same question on the DP-fora, what OCR software > would one recommend on Mac OS X? Is IRIS any good? > > Regards, > > Walter > From prosfilaes at gmail.com Tue Mar 18 04:41:07 2008 From: prosfilaes at gmail.com (David Starner) Date: Tue, 18 Mar 2008 07:41:07 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <000001c888c8$e345ac50$a9d104f0$@co.uk> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com> <000001c888c8$e345ac50$a9d104f0$@co.uk> Message-ID: <6d99d1fd0803180441n22eab75ew7bc12678e97af3d6@mail.gmail.com> On Tue, Mar 18, 2008 at 3:22 AM, Dave Fawthrop wrote: > David Starner wrote > > On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop > wrote: > >> I am an ex-engineer, where attempts at perfection are treated with > derision. > > >Really? As far as I can tell, Boeing works pretty hard to make sure > >that _no_ planes fall out of the sky. > > As an example Aircraft Engines fail occasion, which is why passenger > aircraft > have at least two engines, when one fails the planes will still get down > safely > without that engine. Landing with a dead engine is far from perfection. The whole thing about engineering standards is that you specify clearly what you're trying to achieve. If the goal is to have no planes fall out of the sky, which would be considered underspecified, then landing with a dead engine is meeting that goal. > I can only remember one case where two engines failed at the same time. Gimli Glider is another example. The goal, however, is just that; a goal. No matter what your goals are, and how hard you try to meet them, sometimes you'll fail; you'll get a wrench that doesn't meet your 1/8" specification and somehow slipped past the checks. That doesn't mean you give up the goals. > Three Mile Island, Chernobyl, Windscale for example. Two of which didn't blow up. Only SL-1 and Chernobyl have accidentally blown up. (For research reasons, a couple early reactors were sent critical at the end of the lifespan, just to see what would happen.) From piggy at netronome.com Tue Mar 18 06:22:32 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Tue, 18 Mar 2008 09:22:32 -0400 Subject: [gutvol-d] the o.c.r. errors in the "perpetual p1" experiment In-Reply-To: References: Message-ID: <47DFC218.8000409@netronome.com> Bowerbird at aol.com wrote: > here are the o.c.r. errors in the "perpetual p1" experiment: > > http://z-m-l.com/misc/strapper269errors.html > > 269 lines were changed from the o.c.r. to the "final" version. May I have permission to copy your detailed list into the PGDP wiki? I'd like a single archival location for all the data related to this experiment. I will be using your list of "real" changes to test an automated difference tool to see if it can approximate the accuracy of your manual analysis. You mentioned earlier a handful of defects found in the original text. I'm very interested in seeing those too. Again, I really appreciate the energy you are putting into this experiment. From Bowerbird at aol.com Tue Mar 18 11:03:17 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 18 Mar 2008 14:03:17 EDT Subject: [gutvol-d] the o.c.r. errors in the "perpetual p1" experiment Message-ID: piggy said: > May I have permission to copy your detailed list into the PGDP wiki? again, facts is facts. use my stuff... you never need to ask permission... heck, do you think marcello asked my "permission" when he assembled his "fan site" for me? and he didn't even maintain the proper context... > I will be using your list of "real" changes > to test an automated difference tool > to see if it can approximate > the accuracy of your manual analysis. um, my analysis isn't "manual" by a long shot. just back from vacation, so i haven't written it up fully, but i will get around to doing that soon... the bottom line is pretty simple, though. if d.p. would simply ditch all the unnecessary changes you have the proofers make, it will be dirt-simple for you to get analyses like the ones i've provided here. > You mentioned earlier a handful of defects found in the original text. > I'm very interested in seeing those too. i can't place that. i'd need a more solid reminder of what i said. > Again, I really appreciate the energy you are putting into this experiment. real-world data is fun, in general. this project is kind of a drag, because it puts a magnifying glass on d.p. dysfunctionality, and it would sure be nice if _sometime_ i could talk about what people are doing _right_ instead of doing _wrong_, plus this project has the additional burden of me knowing that juliet will almost certainly fail to act appropriately on the conclusions, but again, i love to explore real-world data, even if the results are obvious. -bowerbird ************** It's Tax Time! Get tips, forms, and advice on AOL Money & Finance. (http://money.aol.com/tax?NCID=aolprf00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/6dc40491/attachment.htm From marcello at perathoner.de Tue Mar 18 12:25:13 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 18 Mar 2008 20:25:13 +0100 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47DF3CF2.6040103@netronome.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> <47DF3CF2.6040103@netronome.com> Message-ID: <47E01719.9000307@perathoner.de> La Monte H.P. Yarroll wrote: > Much more important than finishing a certain number of rounds is to > actually predict the likely number of remaining errors in a specific > text (which we can do with moderate reliability) and then decide which > kind of round to subject it to. Why would the "likely number of remaining errors" be a better estimator for which round to send the text to, than the number of errors found in the last round? The set of errors in a text is recursively enumerable, meaning there is no way to know if you already found them all. Meaning also, you cannot verify your predictions. You will know if your predictions were too low, but never if they were too high. My advise is to just stick to the number of errors found. Do a thing like this: let z << y << x - the text goes to round 1 if there is no preceding round - the text is done if the preceding round finds less than z errors - the text goes to round 3 if the preceding round finds more than z and less than y errors - the text goes to round 2 if the preceding round finds more than y and less than x errors - else the text goes to round 1 -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Tue Mar 18 13:07:36 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 18 Mar 2008 16:07:36 EDT Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment Message-ID: ok, i dug out what i'd written up on doing my analysis. i write quick-and-dirty tools to do _most_ of my work, because that offers me the flexibility i typically need... i haven't yet tried to program an application that can handle an arbitrary o.c.r. output file dropped upon it. i've done enough hacking to know it _can_ be written, and fully expect to write it within the next year or two, but it'll take serious testing to make it robust enough, and then further refinement to make it user-friendly... alas, such a tool would realistically be fine-tuned for o.c.a. output, and until they clean up their o.c.r. act -- they have quotemark, em-dash, and pagebreak issues -- there is no sense in refining things prematurely... so it's easier for me right now to just hack what i need. *** piggy said: > Do you have a tool that makes the classifications > or did you do it by hand? a little bit of both. actually, a whole _lot_ of both. for my money, line-based stuff is the only way to go. first of all, lines are a fairly good count on actual diffs, since the most common line will only contain 1 error... second, a line gives sufficient context to grok the error. third, and perhaps most important of all, lines are easy. at least lines are _usually_ quite easy to handle, _except_ when it comes to d.p. content. the d.p. workflow calls for unnecessary and extensive reworking of the line-endings -- rejoining end-line hyphenates, clothing hyphens, etc. -- so massaging d.p. content _back_ to the p-book linebreaks is the most painful and labor-intensive part of the process... that's why i've recommended before -- and do so again -- that you _not_ have proofers do those unnecessary changes. i have written routines to help with the massaging, but i also end up doing lots of it -- more than i would like -- manually. however, once the linebreaks are normalized between files -- or if they had never been subjected to such distortion -- it's a simple matter to pull out the lines that have differences: just store the lines of each file in arrays, and compare them... treatment of differing lines can also isolate their difference, for useful categorization (e.g., incorrect letter, joined words, improper casing, incorrect punctuation, and various others), and even automated corrections. (this strategy is best used when you resolve two separate digitizations, or you compare two rounds of parallel proofing.) for instance, if the words of the two lines are all identical, except for one pair of them, and one exception-word is found in a spellcheck dictionary while the other is not, you'd auto-change the one that's not. or, if there's a comma-period difference, and the following word is uncapitalized, you'd change the period to a comma. i've used this line-based method of presentation for years, as seen in "revolutionary o.c.r. proofing" on the d.p. forum: > http://www.pgdp.net/phpBB2/viewtopic.php?t=24008 some other examples are here: > http://z-m-l.com/go/oneoo/webone.html > http://z-m-l.com/go/oneoo/weball.html i've also used this paired-line format as the input-format for machine-executed corrections, wherein that paired-line file then become the _change-log_ that reflects the corrections, but we're probably getting a little too far afield with _that_... it _is_ important to understand, however, that what i'm doing carves a very large swath in terms of what _needs_ to be done across the complete range of the electronic-library workflow, and isn't just geared to the performance of this solitary task... there is a certain kind of bad myopia over at d.p. that when an e-text gets posted to p.g., it's "finished". but from _my_ perspective, that represents the _beginning_ of its lifespan. my modus operandi is to think in terms of a whole library... -bowerbird p.s. it's not necessary to use my name over on the d.p. wiki, piggy. it's a generous, civil gesture on your part, to be sure, but some of "the powers that be" (as they call themselves), like juliet and donovan, will reject _anything_ from me out of hand. so there's really no need to put yourself under that handicap... i certainly don't need -- won't even claim -- any of the "credit" if distributed proofreaders suddenly does straighten out its act. ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/53411c9e/attachment.htm From jeroen.mailinglist at bohol.ph Tue Mar 18 13:11:34 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Tue, 18 Mar 2008 21:11:34 +0100 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <20080317181155.GB7041@ark.in-berlin.de> References: <1751743406.11431205758310116.JavaMail.mail@webmail02> <20080317181155.GB7041@ark.in-berlin.de> Message-ID: <47E021F6.40208@bohol.ph> When I do PP, I have a kind of overview of word-statistics and background information that not all Proofers will have access to. Because of that, I often catch things that I can never blame the proofers not catching, such as inconsistencies in the original. These do get fixed (with transcriber notes) before I post a file to PG. In almost every book, I also catch a few things that the proofers ought to catch, but I hardly complain. I would let through more mistakes myself... Over 300 books PP-ed. Jeroen. Ralf Stephan wrote: >> Believe me, PP and PPV are not and have never been meant to be true proofreading rounds (and the reason we left the old 2 round system was to get away from the necessity of proofreading in PP and PPV). >> > > Josh, I know why you say this. Because people should concentrate > on the specialty of each round. But you know quite well that you > as PP are responsible for anything that gets through. Also, you > can't help reading while working on it, and you can't help correcting > when something hits your eyes. > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/c27ec68b/attachment.htm From Bowerbird at aol.com Tue Mar 18 13:31:20 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 18 Mar 2008 16:31:20 EDT Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment Message-ID: jeroen said: > When I do PP, I have a kind of overview of word-statistics and > background information that not all Proofers will have access to. i don't see a reason those "statistics and information" should be kept from the proofers. for instance, i've tested (and enjoyed!) a page display where every word is _colorized_ independently... the higher the frequency of the word, the lighter it became, so very common words like "and" and "the" were practically white. words with just one occurrence in the book were _pure_black_. low-frequency words which weren't in the dictionary were red. inconsistent hyphenation, spelling, and so on were turned blue. won't work fully for color-blind people, but they are kinda rare. as i said, i enjoyed this interface, a lot, and i felt it was effective... i will definitely be incorporating some aspect of it in future work. -bowerbird p.s. i've also found it workable to have an interface where you can right-click on a word and show a contextual menu that lists all the lines in the rest of the book that contain that word. _very_ useful... ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/50aa711f/attachment.htm From piggy at netronome.com Tue Mar 18 13:50:24 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Tue, 18 Mar 2008 16:50:24 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47E01719.9000307@perathoner.de> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> <47DF3CF2.6040103@netronome.com> <47E01719.9000307@perathoner.de> Message-ID: <47E02B10.7090203@netronome.com> Marcello Perathoner wrote: > La Monte H.P. Yarroll wrote: > > >> Much more important than finishing a certain number of rounds is to >> actually predict the likely number of remaining errors in a specific >> text (which we can do with moderate reliability) and then decide which >> kind of round to subject it to. >> > > Why would the "likely number of remaining errors" be a better estimator > for which round to send the text to, than the number of errors found in > the last round? > Someone reading a text does not care how many errors were found in the last round of proofreading. They care about the number remaining. Actually, the number of errors found in the last round appears to be a pretty good predictor for the number of remaining errors, so the distinction is not terribly critical. The relationship is not linear, but it is has a high correlation. > The set of errors in a text is recursively enumerable, meaning there is > no way to know if you already found them all. > But if we know the probability distributions of the errors, we can estimate the likely number remaining, which is really what readers care about. > Meaning also, you cannot verify your predictions. You will know if your > predictions were too low, but never if they were too high. > It's a little strong to say that we can't verify our predictions. We can observe over a large number of experiments how closely the number of errors we find matches what we expect given our model(s). Guiness does not need to taste every bottle of brew to have a high confidence that they are keeping their quality standards. > My advise is to just stick to the number of errors found. Do a thing > like this: let z << y << x > Would you care to quantify these thresholds? > - the text goes to round 1 if there is no preceding round > > - the text is done if the preceding round finds > less than z errors > > - the text goes to round 3 if the preceding round finds > more than z and less than y errors > > - the text goes to round 2 if the preceding round finds > more than y and less than x errors > > - else the text goes to round 1 > This algorithm only works if all three resources are completely fungable, or equivalently, they are perfectly balanced. In practice, deciding to put a page (or a whole project) through P3 reduces the amount of proofer-time available for P1. The result is bottlenecking, a problem we are seeing now. I highly recommend reading http://www.pgdp.net/wiki/Confidence_in_Page_analysis#The_Ferguson-Hardwick_Algorithm . The core problem is to devise the two cost functions C_k (cost of a round), and c_k (cost of a missed misprint). If C_k exceeds E[c_k] (the cost of expected errors left before applying another round of proofing) then you are done. In our case, we have a family of C_k functions, one for each kind of round. It is also import to understand p, the probability of finding a particular error, and lambda, the rate at which particular errors occur in the text. Note that neither p nor lambda are constant for our data. Some errors are more common than others (different lambdas), and some are harder to find than others (different ps). From joshua at hutchinson.net Tue Mar 18 13:54:53 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Tue, 18 Mar 2008 20:54:53 +0000 (GMT) Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment Message-ID: <1253528362.2961205873693940.JavaMail.mail@webmail02> I'm not involved in this project, but my outsiders' understanding is that they are trying to do just that. The values for X, Y, and Z are what they are trying to find the optimal values for. Hey, nothing like a bunch of geeks on the Internet obsessing over numbers, right? :) (No offense intended ... I consider "geek" a badge of honor) Josh On Mar 18, 2008, marcello at perathoner.de wrote: My advise is to just stick to the number of errors found. Do a thing like this: let z << y << x - the text goes to round 1 if there is no preceding round - the text is done if the preceding round finds less than z errors - the text goes to round 3 if the preceding round finds more than z and less than y errors - the text goes to round 2 if the preceding round finds more than y and less than x errors - else the text goes to round 1 From creeva at gmail.com Tue Mar 18 14:06:09 2008 From: creeva at gmail.com (Brent Gueth) Date: Tue, 18 Mar 2008 17:06:09 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <1253528362.2961205873693940.JavaMail.mail@webmail02> References: <1253528362.2961205873693940.JavaMail.mail@webmail02> Message-ID: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> You know on this thread I completely agree with alot of the ideas here and LOVE the idea of the color based frequency word proof reading system. That being said. One thing I noticed being mentioned was correction even if the source material included an error. If your doing a study on classic sci-fi and how it appeared printed in issue X of generic magazine - shouldn't any errors included in the original printing be maintained verbatim? If you fixing possible spelling mistakes at first, what about the trend later to fix grammar mistakes, so essentially PG is going to become the editor to fix things that the original may of honestly put in there intentionally. I have heard that some publishers or authors put in an occasional mistake on purpose to verify if anyone else copied their work. Granted this is more likely to happen in a public domain anthology, but at the same time something may com across as intentional. What about when Twain writes in a dialect - graned we know what the words should be, but at the same time any proof reader would see this as a mispelling. I'm sure that the proofreaders are doing the best they can - but in the end are we looking to end all errors - or are we looking to make sure that the finished text is 100% accurate to the source text? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/e0f88fee/attachment.htm From klofstrom at gmail.com Tue Mar 18 15:06:47 2008 From: klofstrom at gmail.com (Karen Lofstrom) Date: Tue, 18 Mar 2008 12:06:47 -1000 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> References: <1253528362.2961205873693940.JavaMail.mail@webmail02> <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> Message-ID: <1e8e65080803181506q11f4fd8eiad7d41d2e0efc29c@mail.gmail.com> On Tue, Mar 18, 2008 at 11:06 AM, Brent Gueth wrote: > I'm sure that the proofreaders are doing the best they can - but in the end are we looking to end all errors - or are we looking to make sure that the finished text is 100% accurate to the source text? The purpose of the transcriber's notes is to alert readers to changes made in the text, correcting typos in the original. I don't think that DP has always been as careful as it is now to note corrections, but I believe that the current state of affairs honors both reading ease (not being pulled up short by an obvious typo) and accuracy to the original. I believe that this is one virtue of TEI, ne? It has a protocol for noting original and emendation IN the flow of the text, so that presumably you could write viewing software that would display both. -- Karen Lofstrom From Bowerbird at aol.com Tue Mar 18 15:15:47 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 18 Mar 2008 18:15:47 EDT Subject: [gutvol-d] let's make it simple, ok? Message-ID: let's make it simple, ok? the discussion is -- or should be -- how to make a roundless system work. specifically, the question is "when should we consider a page to be _done_?" the answer is really very simple. if the last person found and fixed an error, they changed the page, and thus their change needs to be verified, so the page is not done. as long as the next proofer finds an error, keep sending the page out; even if it's already done 27 rounds, you must keep sending it for more. (this is why you want your workflow not to allow meaningless changes.) when there's a round with no error found (i.e., the page is unchanged), you can figure (for simplicity) there's a 60/40 chance the page is perfect. 60/40 certainly isn't good enough odds, however, so do the page again. if the next person finds no error either, then odds are 84/16 it's perfect. (this assumes that this round, like the last one, gets _60%_ of the errors.) if 84/16 is good enough for you, fine. if not, send it through again, and -- if it comes out clean again -- the odds will then be 90/10 it's perfect. if that's not good enough either, do it again. clean again? now it's 96/4. your assumption throughout is that your proofers catch 60% of the errors. you can take my word that is a safe assumption. but you don't need to. because over time, the _data_ tells you if your assumption is warranted... sometimes -- even after a "clean" judgment by one or more proofers -- the next proofer will find an error. oops! in fact, if a page has gotten two "clean" judgments, at 60% accuracy, odds are 84/16 it has an error. and -- at 60% accuracy -- the odds are that the next proofer will find it. so you just pay attention to the _results_ you actually obtain. if you find such pages -- 2 people said it was clean, but person #3 found an error -- happen 16% of the time, your proofers _do_ have an accuracy rate of 60%. if such pages happen _less_ than 16% of the time, their accuracy if higher. if such pages happen _more_ than 16% of the time, their accuracy is lower. with thousands of proofers doing thousands of pages, it won't take long (not long at all) to get a very good assessment of your proofer accuracy. and knowledge of that figure tells how many "clean" rounds are needed to get to _whatever_ level of accuracy you decide that you want to attain... in sum, you don't need a college statistics professor to solve this problem. you don't even need college-level statistics... you really don't... honestly... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/c0e1ddc6/attachment.htm From prosfilaes at gmail.com Tue Mar 18 15:30:42 2008 From: prosfilaes at gmail.com (David Starner) Date: Tue, 18 Mar 2008 18:30:42 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> References: <1253528362.2961205873693940.JavaMail.mail@webmail02> <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> Message-ID: <6d99d1fd0803181530ndebf17cg3a3809bb158dccef@mail.gmail.com> On Tue, Mar 18, 2008 at 5:06 PM, Brent Gueth wrote: > One thing I noticed being mentioned was correction even if the source > material included an error. If your doing a study on classic sci-fi and > how it appeared printed in issue X of generic magazine - shouldn't any > errors included in the original printing be maintained verbatim? If you're doing a study on how sci-fi was printed in the original issues, there's nothing like having the original issues at hand. Our posts certainly won't cut it, after copyrighted material has been removed and the whole thing reformatted into HTML. There's no way to make an ebook that's perfect for everyone's goals. > If you fixing possible spelling mistakes at first, what about the trend > later to fix grammar mistakes, That's a slippery slope argument. We can choose how liberal the changes we make in the text are. > What about when Twain writes in a dialect - graned we know what the words > should be, but at the same time any proof reader would see this as a > mispelling. Well, no, because the proofreaders have brains. Carrying errors around has costs too. Every mistake we keep can get people pointing it out to errata. Original mistakes can be as distracting to readers as new mistakes. Nothing says we shouldn't carry information about corrections along, but I think for a text whose primary use will be as a reading text, we should make the obvious corrections. If you want the exact pedantic original, then look at the scans we should make available. From sly at victoria.tc.ca Tue Mar 18 18:25:40 2008 From: sly at victoria.tc.ca (Andrew Sly) Date: Tue, 18 Mar 2008 18:25:40 -0700 (PDT) Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <1e8e65080803181506q11f4fd8eiad7d41d2e0efc29c@mail.gmail.com> References: <1253528362.2961205873693940.JavaMail.mail@webmail02> <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> <1e8e65080803181506q11f4fd8eiad7d41d2e0efc29c@mail.gmail.com> Message-ID: On Tue, 18 Mar 2008, Karen Lofstrom wrote: > I believe that this is one virtue of TEI, ne? It has a protocol for > noting original and emendation IN the flow of the text, so that > presumably you could write viewing software that would display both. > TEI is flexible, and has multiple courses you could take, depending on your desired outcome. For a more general approach, you could just add something like a DP transcriber's note, in the appropriate place in the TEI header. Otherwise, you could use the element which leaves an error in place, while suggesting a correction in an attribute. Or the element which presents a corrected version and records the original in an attribute. In http://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html See Section 11.3 Altered, Corrected, and Erroneous Texts Andrew From greg at durendal.org Tue Mar 18 18:47:07 2008 From: greg at durendal.org (Greg Weeks) Date: Tue, 18 Mar 2008 21:47:07 -0400 (EDT) Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> References: <1253528362.2961205873693940.JavaMail.mail@webmail02> <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> Message-ID: On Tue, 18 Mar 2008, Brent Gueth wrote: > One thing I noticed being mentioned was correction even if the source > material included an error. If your doing a study on classic sci-fi and > how it appeared printed in issue X of generic magazine - shouldn't any > errors included in the original printing be maintained verbatim? That's what the page scans are for. Anyone that cares about that level of detail isn't going to trust the proofing job no matter what. If we provide page scans to go with the proofed text they have the best of both. -- Greg Weeks http://durendal.org:8080/greg/ From piggy at netronome.com Tue Mar 18 22:04:50 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 19 Mar 2008 01:04:50 -0400 Subject: [gutvol-d] let's make it simple, ok? In-Reply-To: References: Message-ID: <47E09EF2.3060605@netronome.com> Bowerbird at aol.com wrote: > let's make it simple, ok? > > the discussion is -- or should be -- how to make a roundless system work. > specifically, the question is "when should we consider a page to be > _done_?" > > the answer is really very simple. > > if the last person found and fixed an error, they changed the page, > and thus their change needs to be verified, so the page is not done. > > as long as the next proofer finds an error, keep sending the page out; > even if it's already done 27 rounds, you must keep sending it for more. > (this is why you want your workflow not to allow meaningless changes.) > > when there's a round with no error found (i.e., the page is unchanged), > you can figure (for simplicity) there's a 60/40 chance the page is > perfect. > > 60/40 certainly isn't good enough odds, however, so do the page again. > > if the next person finds no error either, then odds are 84/16 it's > perfect. > (this assumes that this round, like the last one, gets _60%_ of the > errors.) > > if 84/16 is good enough for you, fine. if not, send it through again, and > -- if it comes out clean again -- the odds will then be 90/10 it's > perfect. > > if that's not good enough either, do it again. clean again? now it's > 96/4. If you have a good way to get a solid consensus on what that probability should be, I would like to hear your suggestions. In a way, it's equivalent to my request for suggestions for a missed misprint cost function. One of my near-term goals is to provide a model which at least allows people to understand the time consequences of picking a specific threshold. Picking a simple threshold also neglects the relative importance of different kinds of errors. I think for most books, garbled words are a much more serious problem than period-comma confusion. It could be reasonable to say that we're happy with a 99% certainty on the removal of all garbled words and only a 50% certainty of the removal of all period-comma confusions. > > your assumption throughout is that your proofers catch 60% of the errors. > > you can take my word that is a safe assumption. but you don't need to. > > because over time, the _data_ tells you if your assumption is warranted... We already have a very large dataset with which to test the assumption. I would agree that 60%-75% is about right for the most common kinds of errors. But the rate is not constant. It falls steadily and very fast. I can't say for certain yet, but I think the most difficult class of error has a discovery rate below 15% in a single pass. > > sometimes -- even after a "clean" judgment by one or more proofers -- > the next proofer will find an error. oops! in fact, if a page has gotten > two "clean" judgments, at 60% accuracy, odds are 84/16 it has an error. > and -- at 60% accuracy -- the odds are that the next proofer will find it. > > so you just pay attention to the _results_ you actually obtain. if > you find > such pages -- 2 people said it was clean, but person #3 found an error -- > happen 16% of the time, your proofers _do_ have an accuracy rate of 60%. > > if such pages happen _less_ than 16% of the time, their accuracy if > higher. > if such pages happen _more_ than 16% of the time, their accuracy is lower. > > with thousands of proofers doing thousands of pages, it won't take long > (not long at all) to get a very good assessment of your proofer accuracy. > and knowledge of that figure tells how many "clean" rounds are needed > to get to _whatever_ level of accuracy you decide that you want to > attain... You are neglecting error injection rate. Proofers don't just remove errors, they add them too. The error injection rate puts a lower bound on the accuracy we can achieve through serial proofing. If needed, parallel voting rounds can be used to compensate for the error injection rate. If there are defects which have detection rates down near the error injection floor, it may not be possible to remove them with any level of confidence at all. This is why I'm interested in a difference metric which can ignore "silly changes". It looks very likely that the noise floor (error injection rate) for "real errors" is substantially lower than the noise floor caused by "silly changes". > > in sum, you don't need a college statistics professor to solve this > problem. > you don't even need college-level statistics... you really don't... > honestly... We can definitely make things a lot better without resorting to really high-power statistics. Gaa! Does anybody remember where I posted my detailed analysis of the "shoe plot"? Oh well, the point is that based on a simple graphical analysis I was able to make a strong recommendation that any project with more than 0.1 wa/w (roughly 1 change every 10 words) should repeat P1. I have to admit that the full generality of the problem fascinates me. I am trying hard to balance my interest in closed-form solutions with concrete suggestions which people can act on immediately. > > -bowerbird > From hart at pglaf.org Wed Mar 19 00:20:35 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 19 Mar 2008 00:20:35 -0700 (PDT) Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> References: <1253528362.2961205873693940.JavaMail.mail@webmail02> <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com> Message-ID: On Tue, 18 Mar 2008, Brent Gueth wrote: > You know on this thread I completely agree with alot of the ideas > here and LOVE the idea of the color based frequency word proof > reading system. > > That being said. > > One thing I noticed being mentioned was correction even if the > source material included an error. If your doing a study on > classic sci-fi and how it appeared printed in issue X of generic > magazine - shouldn't any errors included in the original printing > be maintained verbatim? > > If you fixing possible spelling mistakes at first, what about the > trend later to fix grammar mistakes, so essentially PG is going to > become the editor to fix things that the original may of honestly > put in there intentionally. I have heard that some publishers or > authors put in an occasional mistake on purpose to verify if > anyone else copied their work. Granted this is more likely to > happen in a public domain anthology, but at the same time > something may com across as intentional. > > What about when Twain writes in a dialect - graned we know what > the words should be, but at the same time any proof reader would > see this as a mispelling. > > > I'm sure that the proofreaders are doing the best they can - but > in the end are we looking to end all errors - or are we looking to > make sure that the finished text is 100% accurate to the source > text? > Neither. If you want the latter, just use the raw scans or a Xerox. If you want the latter as full text eBooks, just do accurate OCR. If you want to correct obvious errors, that's just fine, most of our readers, including myself, would appreciate it. If you want the former, all I can see is "bon voyage." Thanks!!! Michael S. Hart Founder Project Gutenberg Recommended Books: Dandelion Wine, by Ray Bradbury: For The Right Brain Atlas Shrugged, by Ayn Ran,: For The Left Brain [or both] Diamond Age, by Neal Stephenson: To Understand The Internet The Phantom Toobooth, by Norton Juster: Lesson of Life. . . From marcello at perathoner.de Wed Mar 19 00:21:04 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 19 Mar 2008 08:21:04 +0100 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47E02B10.7090203@netronome.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> <47DF3CF2.6040103@netronome.com> <47E01719.9000307@perathoner.de> <47E02B10.7090203@netronome.com> Message-ID: <47E0BEE0.1040303@perathoner.de> La Monte H.P. Yarroll wrote: > Marcello Perathoner wrote: >> La Monte H.P. Yarroll wrote: >> >> >>> Much more important than finishing a certain number of rounds is to >>> actually predict the likely number of remaining errors in a specific >>> text (which we can do with moderate reliability) and then decide which >>> kind of round to subject it to. >>> >> Why would the "likely number of remaining errors" be a better estimator >> for which round to send the text to, than the number of errors found in >> the last round? >> > > Someone reading a text does not care how many errors were found in the > last round of proofreading. They care about the number remaining. > > Actually, the number of errors found in the last round appears to be a > pretty good predictor for the number of remaining errors, so the > distinction is not terribly critical. The relationship is not linear, > but it is has a high correlation. That's what I was saying. If the two values highly correlate, why go to the extra trouble to calculate the second value? >> The set of errors in a text is recursively enumerable, meaning there is >> no way to know if you already found them all. >> > > But if we know the probability distributions of the errors, we can > estimate the likely number remaining, which is really what readers care > about. Thinko. The reader doesn't care about "the errors remaining". She cares about how many error she "finds" while reading the text. Which probably is a lot less than what an experienced proofreader will find. > It's a little strong to say that we can't verify our predictions. We can > observe over a large number of experiments how closely the number of > errors we find matches what we expect given our model(s). You wanted to predict the "likely number of errors remaining". Which number you cannot verify. > Guiness does not need to taste every bottle of brew to have a high > confidence that they are keeping their quality standards. Why can a potato chip maker make a chip that costs 0.01$ while a computer manufacturer's chip must cost 1000$ ? (BTW is there a searchable Dilbert database anywhere?) -- Marcello Perathoner webmaster at gutenberg.org From hart at pglaf.org Wed Mar 19 00:24:59 2008 From: hart at pglaf.org (Michael Hart) Date: Wed, 19 Mar 2008 00:24:59 -0700 (PDT) Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47E02B10.7090203@netronome.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> <47DF3CF2.6040103@netronome.com> <47E01719.9000307@perathoner.de> <47E02B10.7090203@netronome.com> Message-ID: On Tue, 18 Mar 2008, La Monte H.P. Yarroll wrote: > Marcello Perathoner wrote: >> La Monte H.P. Yarroll wrote: >> >> >>> Much more important than finishing a certain number of rounds is >>> to actually predict the likely number of remaining errors in a >>> specific text (which we can do with moderate reliability) and >>> then decide which kind of round to subject it to. >>> >> >> Why would the "likely number of remaining errors" be a better >> estimator for which round to send the text to, than the number of >> errors found in the last round? >> > > Someone reading a text does not care how many errors were found in > the last round of proofreading. They care about the number > remaining. False. Anyone seriously commenting on the possible correction of remaining errors will want to know how much effort it took to get there. Not to do so would be something like trying to plan the rest of a trip without knowing how many miles you had already travelled, in how much time, taking how much gas, etc., etc. Planning ahead is more than just pointing over the horizon. Thanks!!! Michael S. Hart Founder Project Gutenberg From Bowerbird at aol.com Wed Mar 19 01:53:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 19 Mar 2008 04:53:42 EDT Subject: [gutvol-d] let's make it simple, ok? Message-ID: i said: > > if 84/16 is good enough for you, fine.? > > if not, send it through again, and > > -- if it comes out clean again -- > > the odds will then be 90/10 it's perfect. piggy said: > If you have a good way to get > a solid consensus on > what that probability should be, > I would like to hear your suggestions. well, i've already told people -- repeatedly -- what i think the percentage should be... i'm willing to accept 1 error on every 10 pages -- as the starting point for release to the public so they can help proof the e-texts they "own" - so that means i'm willing to accept a 90/10 rate. that, by definition, means a 10% chance of an error, which informs us there'll be 10 errors in 100 pages, thus yielding my 1-error-on-every-10-pages figure. lucky for us, though, i know we don't have to _settle_ for such a paltry figure. i know that, with good tools, we can boost our accuracy up to around the 99% level, which means 2 or 3 (2.5) errors on a 250-page book, or -- for a 150-pager like your test, just 1 or 2 (1.5) errors. and i can almost guarantee those errors are fairly trivial. i can say with certainty they won't be misspelled words, so -- except for the nightmare scenario of _missing_text_ -- the flaws will probably be related to _punctuation_errors_, and i haven't found a one of those yet that altered the plot, and it's not because i haven't been looking, because i have... (stealth scannos are spotted very quickly by real readers, so they are not nearly as frightening as they might seem to be.) but... still... all of this is rather meaningless... isn't it?... because you didn't ask me what _i_ think, did you? no, you asked how you could get a _consensus_ on _what_ the probability _should_ be. "should be". well, heck, the best way to get that kind of consensus would be to run a poll and see what your people think. and then keep running the poll until everyone agrees... sorry, just kidding with that last part... :+) but seriously, run a poll... make people be specific and pick a number. you'll see the difference between the "quality" and the "quantity" people exists mainly because nobody has bothered to quantify the argument... moreover, in the long run, everyone _will_ agree on it... when everyone sees that -- because of our good tools -- we can get a 99% accuracy level, _and_ obtain that quality with relatively little effort, people will be more than happy to put out the degree of effort needed to utilize those tools. quality will be know to be high, and quantity will begin to fly. really. even the most die-hard quality folks have to accept 2-3 errors in a book as acceptable. because if they don't, they're gonna be slitting their wrists any day now, so we won't have to worry about what they think for very long... but seriously, if you want the consensus opinion, run a poll. -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/44ea6fe5/attachment.htm From Bowerbird at aol.com Wed Mar 19 02:06:27 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 19 Mar 2008 05:06:27 EDT Subject: [gutvol-d] let's make it simple, ok? Message-ID: piggy said: > You are neglecting error injection rate. i'm not "neglecting it". i'm _ignoring_ it. because i haven't wanted to have to tell you that you invented a really stupid concept there. > Proofers don't just remove errors, they add them too. when you have a proofer who is _adding_ errors in -- and you will, so detect them as _beginners_ -- you need to take them aside and give them a lesson. instruct them exactly what the workflow expects of them -- which yes, means the policy needs to be unequivocal -- and then give them a pat on the back and send them back. they're glad you set them straight, and inject no more errors. simple. > If there are defects > which have detection rates > down near the error injection floor, > it may not be possible to remove them > with any level of confidence at all. you're just confusing yourself now. keep it simple... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/883b3c07/attachment.htm From schultzk at uni-trier.de Wed Mar 19 02:43:26 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Wed, 19 Mar 2008 10:43:26 +0100 Subject: [gutvol-d] let's make it simple, ok? In-Reply-To: References: Message-ID: <8C4FB3D0-BC45-41A5-8530-DAFF9C56D5C5@uni-trier.de> Hi Everybody, I have not been following this thread(s) closely, but I am wondering: If it might not be better to have proofing teams!!?? That is a group of proofers that work together. The advantages are the team members get to know the others strengths and weaknesses. The catch each others errors. They know that one is good at this the other that. One is likely to do that wrong. The team is then able to delegate tasks, clean up if necessary after each other. Thus reducing the error injection (though I believe that it is most likely neglectable in most cases) and increasing confidence. A team can discuss things and find solutions by themselves. Also, the team developes its own hiearchy of proofers from proficient to inexperienced. Ar first this may seem complicated, but is actually quite simple. The size of such teams is a good question. The organisation can be left mostly to the anarchy of the net/groups. just my thoughts. Will be out for Easter, but I try to catch up next week. regards and happy easter eggs. keith. Am 19.03.2008 um 10:06 schrieb Bowerbird at aol.com: > piggy said: > > You are neglecting error injection rate. > > i'm not "neglecting it". i'm _ignoring_ it. > > because i haven't wanted to have to tell you > that you invented a really stupid concept there. > > > > Proofers don't just remove errors, they add them too. > > when you have a proofer who is _adding_ errors in > -- and you will, so detect them as _beginners_ -- > you need to take them aside and give them a lesson. > > instruct them exactly what the workflow expects of them > -- which yes, means the policy needs to be unequivocal -- > and then give them a pat on the back and send them back. > > they're glad you set them straight, and inject no more errors. > > simple. > > > > If there are defects > > which have detection rates > > down near the error injection floor, > > it may not be possible to remove them > > with any level of confidence at all. > > you're just confusing yourself now. keep it simple... > > -bowerbird > > > > ************** > Create a Home Theater Like the Pros. Watch the video on AOL Home. > (http://home.aol.com/diy/home-improvement-eric-stromer?video=15? > ncid=aolhom00030000000001) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/a35afc41/attachment-0001.htm From Bowerbird at aol.com Wed Mar 19 03:57:35 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 19 Mar 2008 06:57:35 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 01 Message-ID: for a whole bunch of reasons that i will tell you later, i'm going to largely ignore the "parallel proofing" tests also announced on the "confidence in page" wiki-page. i know parallel proofing works. that fact is established. but since i've run through some of the current text anyway, and it's illustrative, let me quickly share that with you, ok? the list appended here shows the 376 lines that _differed_ on _parallel_ p1 proofings of "paul and the printing press". (say that several times really fast for a plosive experience.) most of these lists that you get from me _usually_ have 1) the old, "wrong" line listed on the top, and 2) the new, "corrected" line listed at the bottom. but _this_ time, _this_ list is different, because _either_ the _top_ line, or the _bottom_, or _both_, could be wrong. (but since they differ, we know that _one_ of them is wrong. i haven't put my "cheater" lines here, to show you _exactly_ where the lines differ, so you will have to figure that out for yourself, but i can assure you that they _do_ differ...) *** one thing remains the same, however, and that is that _many_ of the differences are due to exceedingly stupid d.p. policies... remember, there are 376 lines differing between the proofings. (for perspective, there are 6,575 non-blank lines in this book.) but 100+ of those differences are due to end-of-line hyphenates... (89 such differences occur on pages 138-147 alone, most likely due to a beginning proofer who didn't know he was supposed to rejoin.) and then there are also many differences on things that _could_ (and _should_) have been fixed _automatically_, by the machine, before the text ever went in front of volunteer human proofers... (i eliminated some of these -- the spacey contractions -- merely because i didn't want to be distracted by such meaningless stuff.) when we disregard all those cases, we'll be left with precious few honest-to-goodness differences between the parallel proofings. *** even now, i don't think there's much to say about the differences. in general, the incorrect versions look very much like the typical kind of errors that people get from o.c.r. -- case differences and incorrect letters and punctuation problems and other crap like dat. what happened here is that one proofer found and fixed the error, and the other proofer missed it. knowledge of the number of lines like these -- caught by one proofer _or_ the other -- is interesting. (and resolving those differences is what gives you great accuracy.) but you also need to know the number of errors caught by _both_, and the number of errors caught by _neither_, for the full picture... you'll have to wait for that data, though... -bowerbird > the list appended here shows the 376 lines that _differed_ > on _parallel_ p1 proofings of "paul and the printing press"... > http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea > http://www.pgdp.net/c/project.php?id=projectID45ca5d5645cfb p#001 p#002 p#003 p#004 p#005 p#006 p#007 001 01 Copyright, 1920, 001 01 Copyright, 1920 p#008 002 03 The arch-enchantcr's wand! -- Itself a nothing -- 002 03 The arch-enchanter's wand! -- Itself a nothing -- 003 05 To paralyze the Caesars -- and to strike 003 05 To paralyze the Caesars -- and to stike p#009 p#010 p#011 p#012 004 04 II THE CLASS MEETING AND WHAT FOLLOWED IT I3 004 04 II THE CLASS MEETING AND WHAT FOLLOWED IT 13 005 07 V PAUL GIVES THANKS FOR HIS BLESSINGS...50 005 07 V PAUL GIVES THANKS FOR HIS BLESSINGS... 50 006 16 XIV PAUL MAKES A PILGRIMAGE TO THE CITY...162 006 16 XIV PAUL MAKES A PILGRIMAGE TO THE CITY... 162 p#013 p#014 007 06 More than one digniiied resident of the town 007 06 More than one dignified resident of the town p#015 p#016 008 05 IT was the vision of a monthly paper for the 008 05 It was the vision of a monthly paper for the p#017 p#018 009 09 expensive piece of property, my son," he replied. 009 09 expensive piece of property, my son," he relied. 010 18 "The better way to go at such an undertaking," 010 18 The better way to go at such an undertaking," p#019 p#020 p#021 p#022 011 01 "Say, Cart, what do you think of '2O starting 011 01 "Say, Cart, what do you think of '2O Starting 012 13 "Why, to print our life histories and obituaries 012 13 "Why to print our life histories and obituaries p#023 013 02 scheme? what about that?" 013 02 scheme? What about that?" 014 25 asserted at length. "But the ducats -- where 014 25 asserted at length. " But the ducats -- where 015 28 "I suppose we couldn't buy a press second-hand 015 28 "I suppose we couldn't buy a press secondhand p#024 016 11 some one to print the paper for us." 016 11 someone to print the paper for us." 017 20 them it was their money or their life -- death 017 20 tell them it was their money or their life -- death 018 28 Melville declared. "You ean't expect to boom 018 28 Melville declared. "You can't expect to boom p#025 019 03 the Fire-eater! Have a copy of the Jabberwock! 019 03 the Fire-eater! Have a copy of the Jabbermock! 020 06 get anywhere. why not call it The March 020 06 get anywhere. Why not call it The March 021 14 Hare it is! We'll begin getting subscriptions 021 14 Hare it is! We"ll begin getting subscriptions 022 24 money you and you can't get any one to print 022 24 money you find you can't get any one to print p#026 023 01 "The March Hare" he repeated wlth enthusnasm. 023 01 "The March Hare!" he repeated wlth enthusiasm. p#027 p#028 024 22 Kipper. We'll see what we can do toward 024 22 Kipper. we'll see what we can do toward p#029 025 04 Melville regarded his friend With undisguised 025 04 Melville regarded his friend with undisguised 026 08 March Hare! I can hear the shekels ehinking 026 08 March Hare! I can hear the shekels chinking 027 16 at it. What is one-fifty for such a team of wisdom 027 16 at it. What is one-fifty for such a ream of wisdom 028 17 as we're going to get for out money?" 028 17 as we're going to get for our money?" p#030 029 10 is not one of you who does not to make 029 10 is not one of you who does not want to make p#031 p#032 p#033 030 22 "Great Seott, Paul, but you have got a wily 030 22 "Great Scott, Paul, but you have got a wily 031 29 back a step or two. "I couldn't, Kip. Don't 031 29 back a step or two. " I couldn't, Kip. Don't p#034 032 13 any one else in a minute. But Father's so -- well 032 13 any one else in a minute. But Father's so -- well, p#035 033 04 and swept out of the oflice before your mouth 033 04 and swept out of the office before your mouth 034 21 Paul. At least I can make my try and convince 034 21 Paul. "At least I can make my try and convince 035 26 "I shan"t allow myself to expect much. Even 035 26 "I shan't allow myself to expect much. Even p#036 036 02 half of Melvil1e's opinion. 036 02 half of Melville's opinion. 037 06 reputation of being shrewd, close-fisted, and 037 06 of being shrewd, close-fisted, and 038 09 carryng a grudge to any length for the sheer 038 09 carrying a grudge to any length for the sheer 039 19 Birmingham's most widely circulated daily. 039 19 Burmingham's most widely circulated daily. 040 27 "So you're Paul Cameron. I've had dealings 040 27 So you're Paul Cameron. I've had dealings p#037 041 05 father,"suggested the great man, after he had 041 05 father," suggested the great man, after he had 042 10 Wouldn't like to print the March Hare, a new 042 10 wouldn't like to print the March Hare, a new 043 21 `March Hare." 043 21 March Hare." 044 30 "why, indeed!" 044 30 "Why, indeed!" p#038 045 20 "I had two last night-myself and another 045 20 "I had two last night -- myself and another 046 25 Mr. Carter, the shadow of a Smile on 046 25 Mr. Carter, the shadow of a smile on p#039 p#040 p#041 047 14 "But my father-" burst out Paul, then 047 14 "But my father -- " burst out Paul, then 048 17 calmly. " We differ in politics and we've 048 17 calmly. "We differ in politics and we've 049 19 take my paper-wouldn't do it for love or 049 19 take my paper -- wouldn't do it for love or 050 26 "And with regard to the advertising I mentioned, 050 26 "And with regard to the advertising I mentioned," p#042 051 05 "As for Judge Damon-well, if you ean't 051 05 "As for Judge Damon -- well, if you can't 052 09 law and the best man I know to handle the Subject. 052 09 law and the best man I know to handle the subject. 053 14 staff," ventured Paul. ` 053 14 staff," ventured Paul. p#043 054 08 the Echo?" 054 08 the Echo?"' p#044 055 11 gggfg newspaper was such a difficult and expensive 055 11 newspaper was such a difficult and expensive p#045 056 30 "Oh, it's not that," said Paul quickly. "We 056 30 "Oh, it's not that," said Paul quickly. " We p#046 057 10 "People didn't always use to have paper, 057 10 People didn't always use to have paper, 058 11 my Son" 058 11 my son." 059 19 many kings, bishops, and persons of rank could 059 19 kings, bishops, and persons of rank could 060 30 in them however-material such as the Norse 060 30 in them however -- material such as the Norse p#047 061 01 Sagas and the Odes of Horace-were handed 061 01 Sagas and the Odes of Horace -- were handed p#048 062 12 loss," declared his father good-humored1y. 062 12 loss," declared his father good-humoredly. p#049 063 02 was first no great demand for them. Learning 063 02 was at first no great demand for them. Learning 064 20 the patient workers were so glad when their 064 20 the patient Workers were so glad when their 065 25 "'This book was illuminated, bound, and 065 25 "This book was illuminated, bound, and 066 30 "'Thanks be to God, Hallelujah!' 066 30 "Thanks be to God, Hallelujah!' p#050 067 23 a copy of this adjuratjion to what them host 067 23 a copy of this adjuration to what thou hast 068 25 "Thus, you see, was the copyist forced to 068 25 "Thus, you see, was the eopyist forced to 069 30 manuscripts, and many a one is marred by misspelling 069 30 manuscripts, and many a one is marred by mis-spelling p#051 070 17 and were sold to dignitaries of the Church or to 070 17 and were sold to of the Church or to p#052 071 11 The great objection to this method was that several 071 11 the great objection to this method was that several 072 15 entirely inappropriate to it." 072 15 was entirely inappropriate to it." 073 26 on with the project? You seem bothered." 073 26 with the project? You seem bothered." p#053 074 20 "Yes." 074 20 ??line missing here...?? 075 28 Paul waited an instant, then added dryly: "In 075 28 Paul waited an instant, then added dryly: " In p#054 076 29 that he knew could never be fullfilled and sent 076 29 that he knew could never be fulfilled and sent p#055 077 15 shoulder. "I'l1 do it! I declare if I won't. 077 15 shoulder. "I'll do it! I declare if I won't. 078 16 I ll send in my subscription to the Echo to-morrow. 078 16 I'll send in my subscription to the Echo to-morrow. 079 29 "Mr. Carter said Judge Damon was an expert 079 29 "Mr. Carter said Judge Damon was an ex- 080 30 on international law," explained Paul. 080 30 pert on international law," explained Paul. p#056 081 19 Again courage shone in Pau1's eyes. 081 19 Again courage shone in Paul's eyes. p#057 p#058 082 03 Mr. Cameron was as good as his word. 082 03 MR. CAMERON was as good as his word. 083 24 moment, -- litt1e more, in fact, than a boy like 083 24 moment, -- little more, in fact, than a boy like p#059 084 17 Cameron. "Call them up this minute and nail 084 17 Cameron." Call them up this minute and nail 085 28 "O.K.!" he said. "I talked with one of the 085 28 "O. K.!" he said. "I talked with one of the p#060 p#061 086 17 Caesar did in Gaul, what Cyrus and the Silician 086 17 C?sar did in Gaul, what Cyrus and the Silician 087 22 and by and by the geometries, Roman his- 087 22 and by and by the geometries, Roman histories, 088 23 tories, and the peregrinations of Cyrus were 088 23 and the peregrinations of Cyrus were p#062 p#063 089 11 the judge mischievously. "If you boys propose 089 11 the judge mischievously. "It you boys propose p#064 p#065 090 18 to what methods you resorted to win these concessions 090 18 to what methods you resorted to win these con- 091 19 from these stern-purposed gentlemen. 091 19 cessions from these stern-purposed gentlemen. 092 23 "The judge, for example -- I can't imagine 092 23 "The judge, for example-I can't imagine p#066 p#067 093 20 New York and was, I fancy, glad to find someone 093 20 New York and was, I fancy, glad to find some 094 21 who was interested and would appreciate 094 21 one who was interested and would appreciate p#068 095 10 "Yes, and not only were the first manuscripts 095 10 "Yes, and not only were the first manuseripts 096 20 the common people `for whom they were not 096 20 the common people 'for whom they were not p#069 p#070 p#071 p#072 097 30 them would fill a room." 097 30 them would fill a room.:" p#073 p#074 p#075 098 30 at liberty to send contributions back with 098 30 ways at liberty to send contributions back with p#076 099 18 does n't like, regardless of who wrote it." 099 18 doesn't like, regardless of who wrote it." p#077 p#078 100 04 amid great excitement, excitement that 100 04 amid great excitement, -- excitement that p#079 101 05 was quite an eye opener! A paper for general 101 05 Was quite an eye opener! A paper for general 102 07 Burmingham. There was actually something 102 07 Burmingham. There Was actually something 103 24 else in the paper. Some thought more 103 24 else in the paper. Sorne thought more 104 27 others were for choking oft the girls' artieles on 104 27 others were for choking off the girls' articles on p#080 105 06 body of workers finally stood shoulder to shoulder, 105 06 body of workers hnally stood shoulder to shoulder, 106 28 with a pride in his especial r?le on the team, and 106 28 with a pride in his especial role on the team, and p#081 107 03 manager; the alumn?, now scattered in 107 03 manager; the alumnae, now scattered in 108 10 Into Paul's editorial sanctum articles from 108 10 Into Pau1's editorial sanctum articles from 109 28 Mrs. Wi1bur's garden. 1920 would see 109 28 Mrs. Wilbur's garden. 1920 would see p#082 110 11 one passed through the school corridors, and 110 11 one passed through the school corridors, and ` 111 31 the March Hate appeared, each marked by a 111 31 the March Hare appeared, each marked by a p#083 p#084 112 15 it. I've always envied those ehaps who whispered 112 15 it. I've always envied those chaps who whispered 113 26 impulse is a very selfish one," said his father. 113 26 impulse is a very seliish one," said his father. p#085 114 30 pioneer printers' initial eiiorts were turned in 114 30 pioneer printers' initial efforts were turned in p#086 115 20 be produced -- the first crude attempt at papermaking -- and 115 20 be produced -- the first crude attempt at paper-making -- and 116 28 ones were painted on tablets ot ivory, or engraved 116 28 ones were painted on tablets of ivory, or engraved p#087 117 09 altar cloths -- the brst primitive printing 117 09 altar cloths -- the first primitive printing p#088 p#089 118 03 and Diamonds for the more prosperous 118 03 and Diamonds for the more prosperous ` 119 27 stained-glass windows and mosaies in the 119 27 stained-glass windows and mosaics in the p#090 120 02 There were hieroglyphies in Egypt; 'speaking 120 02 There were hieroglyphics in Egypt; 'speaking 121 06 simple outline, by means of woodeuts, the religious 121 06 simple outline, by means of woodcuts, the religious 122 11 was one of the later and most skilful woodcut 122 11 was one of the later and most skilful Woodcut 123 13 woodcut was to art -- simple, direct, appealing." 123 13 woodcut was to art -- simple, direct, appealing" 124 17 public that desired to read -- which this one did 124 17 is public that desired to read -- which this one did p#091 125 10 a "cover contest", the prize oitered being a 125 10 a "cover contest", the prize offered being a 126 24 forward the f?te, more than one dignifted resident 126 24 forward the f?te, more than one dignified resident p#092 127 01 More than one dignified resident of the town struggled 127 01 More than one dignified resident of town struggled 128 02 into an incongruous garment. 128 02 into an incongruous garment. Page 74. p#093 p#094 129 02 the white Queen, the Red Queen, the Duchess, 129 02 the White Queen, the Red Queen, the Duchess, 130 03 Father william, and the Aged Man. Judge 130 03 Father William, and the Aged Man. Judge 131 06 the Carpenter, and Paul'ss mother, who was 131 06 the Carpenter, and Paul's mother, who was 132 12 the last moment as the Doormouse. 132 12 the last moment as the Dormouse. 133 17 democratie fashion. The frolic had in it a 133 17 democratic fashion. The frolic had in it a 134 21 in years!" ejaculated the postmaster. "Seems 134 21 in years!" ejaculated the postmaster. " Seems 135 22 it like we've all got better acquainted with our 135 22 like we've all got better acquainted with our p#095 136 04 their diiterenees by talking together about their 136 04 their differences by talking together about their 137 30 one evening, "that the printing press was in 137 30 one evening, "that the printing press was invented 138 31 vented by Lawrence Coster (or Lorenz Koster) 138 31 by Lawrence Coster (or Lorenz Koster) p#096 139 20 John a native of Strasburg, who 139 20 John Gutenburg,a native of Strasburg, who p#097 p#098 140 14 he had done the inventor had it all to creat 140 14 he had done the inventor had it all to create 141 19 "How soon did he resmake his metal 141 19 "How soon did he re-make his metal p#099 p#100 142 02 dispute the Archbishop'S Bible was produced 142 02 dispute the Archbishop's Bible was produced 143 11 precisely like the king;s and the Archbishop's. 143 11 precisely like the king's and the Archbishop's. 144 27 "I suppose he went told!" put in Paul 144 27 "I suppose he went and told!" put in Paul p#101 145 22 meantime william Caxton, an English mer 145 22 meantime William Caxton, an English merchant, 146 23 chant, traveled to Holland to buy cloth, and 146 23 traveled to Holland to buy cloth, and 147 28 from Iwestminster Abbey. The first English 147 28 from Westminster Abbey. The first English p#102 148 17 only because of an established precedent, out 148 17 only because of an established precedent, but p#103 149 10 Mandevi1le's Travels, Sidney's 'Arcadia', 149 10 Mandeville's Travels, Sidney's 'Arcadia', p#104 150 03 type. But Gutenburg was the tirst to combine 150 03 type. But Gutenburg was the first to combine 151 05 purposes. In other words, he was the Brst 151 05 purposes. In other words, he was the first p#105 152 11 a volume in itself. Many Scholars and many 152 11 a volume in itself. Many scholars and many p#106 p#107 p#108 p#109 153 20 "But there are short cuts," argued Mr. Cameron. 153 20 "But there are short outs," argued Mr. Cameron. p#110 154 17 his father answered. "Nothinig walks with 154 17 his father answered. "Nothing walks with 155 28 generous eitizens, have opened their doors to 155 28 generous citizens, have opened their doors to p#111 156 25 or enamel. As time went on and the religious 156 25 or enamel. As time Went on and the religious p#112 p#113 157 18 print nfty copies of a volume as several hundred. 157 18 print fifty copies of a volume as several hundred. p#114 158 02 at all. They get a scenario or r?sum? of the 158 02 at all. They get a scenario or resume of the p#115 159 31 it are rectifed. After this it is again corrected 159 31 it are rectified. After this it is again corrected p#116 160 25 technicality as the filling out of a short-line." 160 25 technicality as the filling out of a short line." p#117 161 19 cultured nation. By no means. What I mean 161 19 cultured nation. By no means. what I mean 162 20 is that our public school systerh offers education 162 20 is that our public school system offers education 163 30 citizens can read and write, and vast is 163 30 citizens can read and write, and vast p#118 p#119 164 15 are always seamps in every calling, the best 164 15 are always scamps in every calling, the best p#120 165 13 "Typewriters come at all prices," his father 165 13 "Typewriters Come at all prices," his father p#121 p#122 p#123 p#124 p#125 p#126 p#127 p#128 p#129 p#130 166 07 When the accounts were found to be short, 166 07 When the acounts were found to be short, p#131 167 20 bills as it went along; then its editors. would 167 20 bills as it went along; then its editors would 168 30 What was to be done? 168 30 what was to be done? p#132 169 18 He broke oft speechlessly. 169 18 He broke off speechlessly. 170 26 I can't understand it. We haven't branched 170 26 I can't understand it. we haven't branched p#133 171 08 for a farm down East. And how the fresh-men 171 08 for a farm down East. And how the freshmen 172 14 "I, for one, say we don't tell anybody," Mel- 172 14 "I, for one, say we don't tell anybody," Melville 173 15 ville burst out. "I've some pride and I draw 173 15 burst out. "I've some pride and I draw 174 28 "We? 174 28 "We?" p#134 175 07 "Yep" 175 07 "Yep." 176 09 "Could you manage it -- fifty dollars?" 176 09 "Could you manage it-fifty dollars?" 177 17 "I don't care about being joshed, either," dedared 177 17 "I don't care about being joshed, either," declared 178 19 "Something's fussing you. What is it?" 178 19 "Something's fussing you. what is it?" p#135 179 12 Bond" was converted into cash; Paul's typewriter 179 12 Bond" was converted into cash; Paul'S typewriter p#136 p#137 180 01 well. In fact, it was not long before these de- 180 01 well. In fact, it was not long before these departments 181 02 partments were merged into a sort of forum 181 02 were merged into a sort of forum 182 07 Arthur Presby Carter sat quietly in his oiiiee 182 07 Arthur Presby Carter sat quietly in his office 183 17 confess that a seventeenyear-old boy had 183 17 confess that a seventeen-year-old boy had p#138 184 09 publication had been born that was undermin- 184 09 publication had been born that was undermining 185 10 ing his prestige and putting to naught his creeds 185 10 his prestige and putting to naught his creeds 186 21 was a shrewd business man. He had, he con- 186 21 was a shrewd business man. He had, he confessed 187 22 fessed to himself, been trapped into printing 187 22 to himself, been trapped into printing p#139 188 02 which he had never suspected the existence, -- 188 02 which he had never suspected the existence, -- an 189 03 an intelligence, an open-mindedness, a search- 189 03 intelligence, an open-mindedness, a searching 190 04 ing after truth. Hitherto the subscribers to 190 04 after truth. Hitherto the subscribers to 191 11 through every page -- that beating of hearts -- 191 11 through every page -- that beating of hearts -- fathers, 192 12 fathers, mothers, girls, boys speaking with 192 12 mothers, girls, boys speaking with 193 16 blood that glowed so warmly and sympatheti- 193 16 blood that glowed so warmly and sympathetically 194 17 cally through the dead mediums of paper and 194 17 through the dead mediums of paper and 195 27 characteristic honesty that had he cared to ob- 195 27 characteristic honesty that had he cared to obtain 196 28 tain from them this free expression of opinion 196 28 from them this free expression of opinion 197 29 and learn the reactions their minds were con- 197 29 and learn the reactions their minds were constantly 198 30 stantly reflecting, he would have been at a loss 198 30 reflecting, he would have been at a loss p#140 199 02 mere boy, a boy the age of his own son, the elu- 199 02 mere boy, a boy the age of his own son, the elusive 200 03 sive result had been accomplished! 200 03 result had been accomplished! 201 13 It was this " echoing idea" that was new to 201 13 It was this "echoing idea" that was new to 202 21 appeal, the elder man faced the real psychologi- 202 21 appeal, the elder man faced the real psychological 203 22 cal secret of the junior paper's success: it lis- 203 22 secret of the junior paper's success: it listened 204 23 tened and did not talk; it was a dialogue instead 204 23 and did not talk; it was a dialogue instead 205 24 of a monologue,-an exact reversal of his policy. 205 24 of a monologue, -- an exact reversal of his policy. 206 25 Moreover, this dialogue, contrary to his pre- 206 25 Moreover, this dialogue, contrary to his previous 207 26 vious beliefs, presented amazingly interesting 207 26 beliefs, presented amazingly interesting 208 29 America, -- what its government, its statesman- 208 29 America, -- what its government, its statesmanship, 209 30 ship, its ideals should be. The Past was rich 209 30 its ideals should be. The Past was rich p#141 210 02 faith, courage. Youth, the citizen of to-mor- 210 02 faith, courage. Youth, the citizen of to-morrow, 211 03 row, had a thousand theories for righting the 211 03 had a thousand theories for righting the 212 12 stimulate but to silence discussion and it prob- 212 12 stimulate but to silence discussion and it probably 213 13 ably did so, descending upon its audience with a 213 13 did so, descending upon its audience with a 214 18 not to lift up his voice in its presence and de- 214 18 not to lift up his voice in its presence and demand 215 19 mand a hearing. 215 19 a hearing. 216 20 Such a novel and rare product was worth per- 216 20 Such a novel and rare product was worth perpetuating. 217 21 petuating. From a money standpoint alone the 217 21 From a money standpoint alone the 218 22 paper might become in time a paying invest- 218 22 paper might become in time a paying investment. 219 23 ment. It was, of course, a bit crude at present; 219 23 It was, of course, a bit crude at present; 220 28 enterprise at the end of the year and take it in- 220 28 enterprise at the end of the year and take it into 221 29 to his own hands? Might it not be nursed into 221 29 his own hands? Might it not be nursed into p#142 222 01 He would improve it -- that would go with- 222 01 He would improve it -- that would go without 223 02 out saying -- touch it up and polish it; doubt- 223 02 saying -- touch it up and polish it; doubtless 224 03 less he would think best to revise some of its 224 03 he would think best to revise some of its 225 06 could not continue to perpetuate such an ab- 225 06 could not continue to perpetuate such an absurdity 226 07 surdity as that title. Perhaps he would christen 226 07 as that title. Perhaps he would christen 227 09 The notion of purchasing the amateur prod- 227 09 The notion of purchasing the amateur product 228 10 uct appealed to his sense of humor. The more 228 10 appealed to his sense of humor. The more 229 14 Yes, he would get out the few remaining is- 229 14 Yes, he would get out the few remaining issues 230 15 sues of the March Hare under its present name 230 15 of the March Hare under its present name 231 25 himself in the solitude and silence of his edi- 231 25 himself in the solitude and silence of his editorial 232 26 torial sanctum. And after he had disposed of 232 26 sanctum. And after he had disposed of 233 29 deliberation to purchase also certain oil prop- 233 29 deliberation to purchase also certain oil properties 234 30 erties in Pennsylvania. For Mr. Arthur 234 30 in Pennsylvania. For Mr. Arthur p#143 235 03 and buying March Hare or oil wells was all 235 03 and buying March Hares or oil wells was all p#144 236 04 and thus reflected on his many business ven- 236 04 and thus reflected on his many business ventures 237 05 tures Paul Cameron was also sitting in his ed- 237 05 Paul Cameron was also sitting in his editorial 238 06 itorial domain thinking intently. 238 06 domain thinking intently. 239 08 treasury bothered him more than he was will- 239 08 treasury bothered him more than he was willing 240 09 ing to admit. It was, of course, quite possible 240 09 to admit. It was, of course, quite possible 241 10 for him to repair the error -- for he was con- 241 10 for him to repair the error -- for he was convinced 242 11 vinced an error in the March Hare's bookkeep- 242 11 an error in the March Hare's bookkeeping 243 12 ing had caused the shortage. A bill of a hun- 243 12 had caused the shortage. A bill of a hundred 244 13 dred dollars must have been paid and not re- 244 13 dollars must have been paid and not recorded. 245 14 corded. Melville Carter had never had actual 245 14 Melville Carter had never had actual 246 22 was no easy task. It was a thankless job, any- 246 22 was no easy task. It was a thankless job, anywy 247 23 way -- the least interesting of any of the posi- 247 23 -- the least interesting of any of the positions 248 24 tions on the paper, and one that entailed more 248 24 on the paper, and one that entailed more p#145 249 11 mistake of one figure in adding and subtract- 249 11 mistake of one figure in adding and subtracting 250 12 ing columns. There did not, it was true, seem 250 12 columns. There did not, it was true, seem 251 30 were that a boy of seventeen was unable to an- 251 30 were that a boy of seventeen was unable to answer! 252 31 swer! If he were to ask his father how to sell 252 31 If he were to ask his father how to sell p#146 253 01 the bond, it might arouse suspicion, to ask any- 253 01 the bond, it might arouse suspicion, to ask anybody 254 02 body else might do so too. People would won- 254 02 else might do so too. People would wonder 255 03 der why he, Paul Cameron, was selling a Lib- 255 03 why he, Paul Cameron, was selling a Liberty 256 04 erty Bond he had bought only a short time be- 256 04 Bond he had bought only a short time before. 257 05 fore. Burmingham was a gossipy little town. 257 05 Burmingham was a gossipy little town. 258 24 thought he realized that Mr. Stacy was an in- 258 24 thought he realized that Mr. Stacy was an intimate 259 25 timate friend of his father's and might mention 259 25 friend of his father's and might mention 260 26 the incident. Therefore he at length dis- 260 26 the incident. Therefore he at length dismissed 261 27 missed the possibility of selling his bond and 261 27 the possibility of selling his bond and p#147 262 01 Echo offices that day with copy for the next is- 262 01 Echo offices that day with copy for the next issue 263 02 sue of his paper and was still rebelliously wa- 263 02 of his paper and was still rebelliously wavering 264 03 vering over the loss of his typewriter when the 264 03 over the loss of his typewriter when the 265 13 with Mr. Carter, toward whom he still main- 265 13 with Mr. Carter, toward whom he still maintained 266 14 tained no small degree of awe; usually the af- 266 14 no small degree of awe; usually the affairs 267 15 fairs relative to the school paper were trans- 267 15 relative to the school paper were transacted 268 16 acted either through the business manager of 268 16 either through the business manager of 269 18 But to-day Mr. Carter was suddenly all ami- 269 18 But to-day Mr. Carter was suddenly all amiability. 270 19 ability. He escorted Paul into his sanctum, 270 19 He escorted Paul into his sanctum, 271 23 "How is your paper coming on, Paul?' he 271 23 "How is your paper coming on, Paul?," he 272 27 "Austin, our manager, tells me your circu- 272 27 "Austin, our manager, tells me your circulation 273 28 lation is increasing." 273 28 is increasing." p#148 p#149 p#150 274 12 "B -- u -- t -- " stammered Paul and then 274 12 "B -- u -- t-" stammered Paul and then p#151 p#152 p#153 275 03 "I -- I -- " faltered Paul. 275 03 "I -- I-" faltered Paul. 276 30 "I don't quite -- " 276 30 "I don't quite-" p#154 p#155 277 09 it." 277 09 it." t 278 14 "y -- e -- s." 278 14 "Y -- e -- s." 279 18 "Oh-ho! So you're in a scrape, eh?" 279 18 "Oh -- ho! So you're in a scrape, eh?" p#156 280 02 Paul. Page 137. 280 02 Paul. Page 13T. p#157 p#158 281 09 prefer. A loan with a bond for security is 281 09 prefer, A loan with a bond for security is 282 12 "But -- " 282 12 :But -- " p#159 p#160 283 20 Paul fingered the bill nervously. Fifty dollars! 283 20 Paul lingered the bill nervously. Fifty dollars! p#161 p#162 p#163 p#164 284 02 money and government notes are fine examples 284 02 money and government notes are line examples p#165 285 23 of the authorities but it does a 285 23 of the Washington authorities but it does a 286 31 quantities of paper," answered his father. 286 31 quantities of paper," answered his father; p#166 287 01 "Directories, telephone books, cireulars, and 287 01 "Directories, telephone books, circulars, and 288 17 t them in color; dry goods houses send out photographs 288 17 them in color; dry goods houses send out photographs 289 21 there are commercial nrms whose mail-order 289 21 there are commercial firms whose mail-order 290 28 little expense this means of advertising is be- 290 28 little expense this means of advertising is becoming 291 29 coming more and more popular. Many charities 291 29 more and more popular. Many charities p#167 292 12 do little else," smiled his father. " Nevertheless, 292 12 do little else," smiled his father. "Nevertheless, p#168 p#169 293 17 Mr. Cameron waited a second. A Wild impulse 293 17 Mr. Cameron waited a second. A wild impulse p#170 p#171 p#172 294 14 gig had won the election, it is true, but it had been 294 14 had won the election, it is true, but it had been p#173 295 13 school, and all the web of circumstances in 295 13 school, and all the Web of circumstances in p#174 p#175 p#176 296 25 press rooms for striking off proof when the 296 25 press rooms for striking oil proof when the p#177 297 27 a press was built up which is so intricate and 297 27 a press Was built up Which is so intricate and p#178 p#179 298 12 visit to a big newspaper office Saturday evening 298 12 visit to a big newspaper offfice Saturday evening 299 15 "That Would be great!" 299 15 "That would be great!" p#180 p#181 300 06 you must remember that it was especially difficult 300 06 you must remember that it was especially diffcult p#182 301 05 "So, son," concluded Mr. Wright, "you've 301 05 "So, son," concluded Mr. wright, "you've 302 15 approve of the fifty-dollar bill which at that 302 15 approve of the fity-dollar bill which at that p#183 p#184 303 07 about 303 07 about. p#185 p#186 304 01 their days. " I'm going to take you upstairs 304 01 their days." 305 02 first," Mr. Hawley said briskly. "We may 305 02 Mr. Hawley said briskly. "We may 306 08 frankly. ` 306 08 frankly. p#187 307 21 This cast is then fitted upon the rollers 307 21 This east is then fitted upon the rollers p#188 308 11 have the main idea and when I see the thing in 308 11 have the main idea and When I see the thing in p#189 309 04 surface." 309 04 surface.' 310 17 The style or design of letter is called the `face', 310 17 The style or design of letter is called the 'face', 311 30 find what they want. I should think --" 311 30 find what they want. I should think -- " p#190 312 19 a small space allowed it; X, too, is not much in 312 19 a small space allowed it; N, too, is not much in p#191 p#192 313 19 large metal sections that fit on the two halves of 313 19 large metal sections that lit on the two halves of p#193 314 13 type constantly becorne very expert in detecting 314 13 type constantly become very expert in detecting 315 30 process and know how the first printing 315 30 process and know how the brst printing p#194 p#195 316 16 gteat amount of time and thought that goes 316 16 great amount of time and thought that goes p#196 317 18 of each shelf classified and marked." 317 18 of each shelf classined and marked." p#197 318 27 and see some of ours at nrst hand." 318 27 and see some of ours at first hand." p#198 319 11 however, the Boston Post ventured an innovation 319 11 however, the Boston Post ventured an innovation by 320 12 by arranging its presses one over the other, 320 12 arranging its presses one over the other, 321 17 " If floor space can be economized it must be 321 17 "If floor space can be economized it must be p#199 p#200 322 01 They had now reached the lowest floor and 322 01 They had now reached the lowest Hoor and 323 06 a high above his head. 323 06 high above his head. 324 31 duty it was to load it on to a truck, carry it up- 324 31 it duty it was to load it on to a truck, carry it up- p#201 325 15 periodicals," Mr. Hawley managed to shout 325 15 periodicals, "Mr. Hawley managed to shout 326 25 during the war," stammered Paul. 326 25 during the war," Stammered Paul. p#202 327 19 publishers." 327 19 publishers." I p#203 328 02 the cardboard. The thickness of these semicylindrical 328 02 the cardboard. The thickness of these semi-cylindrical 329 09 cast, the half sections of stereotype were put 329 09 cast, the sections of stereotype were put p#204 330 01 little chap over there by the fire hangs our 330 01 little chap over there by the bre hangs our 331 16 Sidewalk. 331 16 sidewalk. 332 27 we ought to pay more for our newspapers." 332 27 we ought to pay more for our newspapers.' p#205 333 14 fine articles from parents and distant 333 14 fine articles from patents and distant p#206 334 02 bid good-by to the familiar halls of the school, 334 02 bid good-by to the familiar balls of the school, 335 21 clouded Pau1's brow. He still had intact Mr. 335 21 clouded Paul's brow. He still had intact Mr. p#207 336 05 own, was far from being the same thing as returning 336 05 own, was far from being the same thing as returning it. 337 06 it. It was strange that it should be so 337 06 It was strange that it should be so 338 20 easily to be cleared from Pau1's path. 338 20 easily to be cleared from Paul's path. 339 28 his classmates to earn it, -- for earn it he must, 339 28 his classmates to earn it, -- -for earn it he must, p#208 p#209 340 28 "Because -- well -- it would be so yellow," 340 28 "Because -- well-it Would be so yellow," 341 30 thing is yours -- why -- ," he broke off help 341 30 thing is yours -- why -- ," he broke off help- p#210 342 26 he wanted to sell them. Father said so. Besides, 342 26 he wanted to sell them. Father said so. Be 343 27 what's to become of 1921 if you sell out 343 27 sides, what's to become of 1921 if you sell out p#211 p#212 344 05 "But -- to sell it out for cash, as it stands -- you 344 05 "But -- to sell it out for cash, as it stands -- 345 06 mean that?" 345 06 you mean that?" 346 09 "Yes" 346 09 "Yes." p#213 347 22 he heard himself saying, "I'd call it a beastly 347 22 he heard himself saying, " I'd call it a beastly p#214 348 06 Hare to 1921 with out blessing?" asked Paul, 348 06 Hare to 1921 with our blessing?" asked Paul, 349 22 "Nothing! Cut it out, that's all." 349 22 "Nothing! 'Cut it out, that's all." p#215 350 24 "Yes, I'm corning right now," returned Paul 350 24 "Yes, I'm coming right now," returned Paul 351 31 with the boy? 351 31 with the boy?' p#216 352 20 fancy the corning interview with Mr. Carter. 352 20 fancy the coming interview with Mr. Carter. p#217 353 06 only too fast. 4 353 06 only too fast. 354 16 come, something within him had leaped into being, -- something 354 16 come, something within him had leaped into being, 355 17 that had automatically prevented 355 17 -- something that had automatically prevented p#218 356 28 other side of it and all retreat would be out off. 356 28 other side of it and all retreat would be cut off. 357 30 only that he dreaded...The knob turned 357 30 only that he dreaded... The knob turned p#219 358 17 sharply. " I'm sorry to hear that. What was 358 17 sharply. "I'm sorry to hear that. What was p#220 p#221 p#222 359 15 Mr. Carter -- "you were just right, son. The 359 15 Mr. Carter -- " you were just right, son. The p#223 360 03 "Why, sir, I can't-" 360 03 "Why, sir, I can't -- " p#224 361 11 "How are you, old man," Paul called jubilantly. 361 11 "How are you, old man,' Paul called jubilantly. 362 21 "He was great-corking!" 362 21 "He was great -- corking!" p#225 363 07 Donald broke into a laugh. t 363 07 Donald broke into a laugh. p#226 364 30 loyally refusing to peach on his chums. That 364 30 loyally refusing to peach on his churns. That p#227 365 24 "They say there always has to be a first time. 365 24 "They say there always has to be a fist time. p#228 p#229 366 27 wretchedly. "That's what's got me fussed. 366 27 wretchedly. "That's what'S got me fussed. p#230 p#231 367 18 Paul. "But it's all right now. The 367 18 Paul. " But it's all right now. The 368 19 accounts are O. K.; I shall get my money back; 368 19 accounts are O.K.; I shall get my money back; 369 27 him," cried Donald. " He's a trump! As for 369 27 him," cried Donald. "He's a trump! As for p#232 370 15 that money. It's caused too much worry already." 370 15 that money. It's caused too much Worry already." p#233 371 06 delivered was clicked off on Mr. Carter's typewriter 371 06 delivered was clicked offon Mr. Carter's typewriter p#234 372 17 "And I on yours, Mr. Carter. Melville is a 372 17 "And I oh yours, Mr. Carter. Melville is a 373 30 school end the community a service, Carter, by 373 30 school and the community a service, Carter, by p#235 p#236 374 25 Cameron." 374 25 Cameron.' p#237 p#238 375 03 With a sigh glad yet regretful, Paul Surrendered 375 03 With a sigh glad yet regretful, Paul surrendered 376 12 familiar classroorns. And the comrades of 376 12 familiar classrooms. And the comrades of p#239 p#240 p#241 p#242 p#243 p#244 ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/517cbf0a/attachment-0001.htm From julio.reis at tintazul.com.pt Wed Mar 19 04:08:38 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Wed, 19 Mar 2008 11:08:38 +0000 Subject: [gutvol-d] gutvol-d Digest, Vol 44, Issue 24 In-Reply-To: References: Message-ID: <1205924918.27032.49.camel@abetarda.mshome.net> > just to remind them, off the dome, this is what needs to be done: > 1. ensure you have decent scans, and name them intelligently. > 2. use a decent o.c.r. program, and ensure quality results. > 3. do not tolerate bad text handling by content providers. > 4. do a decent post-o.c.r. cleanup, before _any_ proofing. > 5. retain linebreaks (don't rejoin hyphenates or clothe em-dashes). > 6. change the ridiculous ellipse policy to something sensible. > 7. stop doing small-cap markup with no semantic meaning. > 8. i forget what 8 was for. > 9. retain pagenumber information, in an unobtrusive manner. > 10. format the ascii version using light markup, for auto-html. > > -bowerbird Hey, I like bowerbird's item number 8 :) Lighten up people, just a joke. Please go easy on the ranting. DP or not DP, we all want the best for the public domain. Right? Let's shake hands or kiss, now. bowerbird's issues number 1, 2, 3, and 4 also get a "certainly" from me. 6 is language-dependent so I'm staying out of that issue. English in not my language; I ask my kitty to type all my English ^___^ I'm not commenting on the other items, because I'm trying to be positive here; consensus-building and the like. But these first four items -- I'm with you all the way. Actually (I can't resist another joke), bowerbird's number 8 was "type everything in lowercase" lol. hey, never mind man, i like your writing style. i can always tell when it's your post even without looking at the from field. J?lio. From piggy at netronome.com Wed Mar 19 05:03:50 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 19 Mar 2008 08:03:50 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> <47DF3CF2.6040103@netronome.com> <47E01719.9000307@perathoner.de> <47E02B10.7090203@netronome.com> Message-ID: <47E10126.8010300@netronome.com> Michael Hart wrote: > On Tue, 18 Mar 2008, La Monte H.P. Yarroll wrote: > > >> Marcello Perathoner wrote: >> >>> La Monte H.P. Yarroll wrote: >>> >>> >>> >>>> Much more important than finishing a certain number of rounds is >>>> to actually predict the likely number of remaining errors in a >>>> specific text (which we can do with moderate reliability) and >>>> then decide which kind of round to subject it to. >>>> >>>> >>> Why would the "likely number of remaining errors" be a better >>> estimator for which round to send the text to, than the number of >>> errors found in the last round? >>> >>> >> Someone reading a text does not care how many errors were found in >> the last round of proofreading. They care about the number >> remaining. >> > > > False. > > Anyone seriously commenting on the possible correction of remaining > errors will want to know how much effort it took to get there. > > Not to do so would be something like trying to plan the rest of a > trip without knowing how many miles you had already travelled, in > how much time, taking how much gas, etc., etc. > > Planning ahead is more than just pointing over the horizon. > Very good point. I was only thinking about the casual reader. My wife suggests that we could include accuracy-related metrics in a proofing note for each book. It's not necessarily useful for everyone, but it would be nice to not lose the information. > Thanks!!! > > Michael S. Hart > Founder > Project Gutenberg > From Bowerbird at aol.com Wed Mar 19 09:03:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 19 Mar 2008 12:03:42 EDT Subject: [gutvol-d] so, what have we learned from "perpetual p1"? Message-ID: wrote this yesterday. still appropriate today. *** now that the results are in for "perpetual p1", let's ask ourselves some questions about the "confidence in page" work which generated it. *** so, what did we learn from "perpetual p1"? not much, really. p1 will remove a large mass of the errors, and subsequent rounds will whittle away at the rest, until finally there are only very few remaining... but, um, well... who didn't know this already? proofing is an easy job. any person who is motivated to do it well _can_ do it fairly well, as long as they understand a book's content. (can't proof greek too well if you don't know it, or equations if you haven't taken math classes. but under most circumstances, proofing is easy.) *** will proofers cycle corrections back-and-forth? if your workflow allows them to do so, maybe, especially if you don't train them adequately... but the best course will be to fix the workflow. *** and how is the workflow over at d.p.? it sucks. badly. it imposes meaningless work on the proofers, and doesn't facilitate the work they need to do, so the efficiency of the operation is very weak... (and it's _not_ ok to waste people's time just because they've voluntarily given it.) *** so, did we accomplish our mission? um, nope... not as far as i can see... the original charter was to determine a "confidence in page" measure that would tell if a page needed to be proofed again, to use in implementing a roundless system. but somewhere, the mission was abandoned. now there's just a big bunch of gobbledygook on the "confidence in page" wiki-page at d.p.: > http://www.pgdp.net/wiki/Confidence_in_Page_analysis ironically, the _best_ logic answering the question is stuff that was put on the wiki early. by piggy: > If it covers ALL pages, then we can conclude that > each round finds about half as many pages > with errors as the previous round. This is the sort of > stable epsilon process I've been expecting. I THINK > this translates into "Each round finds about > half the remaining defects." > Each round of a page having zero changes merely > increases our confidence that it is defect-free > (by a factor of about 2). note that if you're cutting the number of errors in half, that your accuracy rate is 67%, which is just about what i've been figuring all along as the "average" accuracy... *** so, was "perpetual p1" worth the time spent on it? no. still, i spent the time anyway, to document it... but juliet wants to believe that proofing is _difficult_ -- the message she spreads all around cyberspace -- so she's going to ignore the conclusion that it's not... that p1 proofers can do well doesn't fit her viewpoint. and there's such a huge investment in the "p2" and "p3" hierarchy over at d.p. they probably cannot dismantle it, not without making themselves look really really stupid, so they ain't about to do that any time in the near future. this whole "confidence in page" wild goose chase is just the equivalent of the "busywork" required of p1 proofers -- the wizard of oz sending dorothy and her companions off to collect the broom of the wicked witch of the west, as a way to make them _go_away_ and waste their time... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/9568aae8/attachment.htm From marcello at perathoner.de Wed Mar 19 11:36:35 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 19 Mar 2008 19:36:35 +0100 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47E02B10.7090203@netronome.com> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> <47DF3CF2.6040103@netronome.com> <47E01719.9000307@perathoner.de> <47E02B10.7090203@netronome.com> Message-ID: <47E15D33.7080901@perathoner.de> In a discussion about proofreading La Monte H.P. Yarroll wrote: > Guiness does not need to taste every bottle of brew to have a high > confidence that they are keeping their quality standards. Learn how to produce an ebook in one second: http://dribibu.xs4all.nl/dilbert19950628.html BTW: its "Guinness". -- Marcello Perathoner webmaster at gutenberg.org From piggy at netronome.com Wed Mar 19 12:06:17 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 19 Mar 2008 15:06:17 -0400 Subject: [gutvol-d] a write-up of the final results on the "perpetual p1" experiment In-Reply-To: <47E15D33.7080901@perathoner.de> References: <47D983EF.5060002@netronome.com> <000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com> <000901c88847$7a8066f0$6f8134d0$@co.uk> <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com> <47DF3CF2.6040103@netronome.com> <47E01719.9000307@perathoner.de> <47E02B10.7090203@netronome.com> <47E15D33.7080901@perathoner.de> Message-ID: <47E16429.1080806@netronome.com> Marcello Perathoner wrote: > In a discussion about proofreading La Monte H.P. Yarroll wrote: > > >> Guiness does not need to taste every bottle of brew to have a high >> confidence that they are keeping their quality standards. >> > > Learn how to produce an ebook in one second: > > http://dribibu.xs4all.nl/dilbert19950628.html > > > BTW: its "Guinness". > > Um, that was a deliberately inserted misprint to keep the proofreaders happy. Yep. OK, my reference to Guinness was a bit obscure. That brewing company had a trade secret method for testing properties of their product with very small samples. The trade secret was eventually lost when "A. Student" published the details. That trade secret method is a statistical test now called "Student's T". From Bowerbird at aol.com Wed Mar 19 13:43:00 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 19 Mar 2008 16:43:00 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 02 Message-ID: ok, let's cut right to the chase on this parallel test... this book did the normal d.p. p1/p2/p3 workflow... then p1 was repeated, from the original o.c.r. output. now, as i reported earlier, the two parallel versions of p1 had 376 differences between them. i resolved all those, by doing a quick visual check (without referring to scans, so you can assume i made a few mistakes in there, sorry). i then compared the _second_ parallel p1 (resolved) to the _p3_ version resulting from the normal workflow, which we assume to be the most accurate text we have at this point... i've appended the mere 87 differences between the versions. and a glance at them reveals there are some cases where the p1p (p1 parallel) version seems to be the correct one, and _not_ the p3n (p3 normal) version, which is humorous. there are also cases of meaningless linebreak differences, words that wouldn't pass spellcheck, and errors that could be easily found by using even a rudimentary clean-up tool. after we reconcile all of that, we're probably looking at about <44 errors that were left after two resolved parallel proofings. so what can we conclude already from this experiment? _2_ parallel p1 proofings have given us results that are quite similar to _3_ rounds -- p1/p2/p3 -- in the normal workflow. ponder that one... i have not pored over all this data to verify the accuracy, and i don't really intend to do so, because the results are clear enough to me already. p1 proofers do darn good... -bowerbird ----------------------------------------------------------------------- more results from the d.p. parallel proofing test: > http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea > http://www.pgdp.net/c/project.php?id=projectID45ca5d5645cfb this list contrasts the p1 parallel proofing (p1p) with the output from the p3 normal workflow (p3n)... again, remember, these are _differences_ only, so the top line _or_ the bottom line, or _both_, could be the incorrect ones. note: use a fixed-point font in order to utilize the "cheater" line... p1p) Copyright, 1920 p3n) Copyright, 1920, ===) ===============^ p1p) V PAUL GIVES THANKS FOR HIS BLESSINGS...50 p3n) V PAUL GIVES THANKS FOR HIS BLESSINGS... 50 ===) ========================================^^^ p1p) XIV PAUL MAKES A PILGRIMAGE TO THE CITY...162 p3n) XIV PAUL MAKES A PILGRIMAGE TO THE CITY... 162 ===) ==========================================^^^^ p1p) "Enough to till a good-sized daily, I should p3n) "Enough to fill a good-sized daily, I should ===) ===========^================================ p1p) "Why, to print our life histories and obituaries p3n) "Why to print our life histories and obituaries ===) ====^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) money you and you can't get any one to print p3n) money you find you can't get any one to print ===) ==========^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) "The March Hare!" he repeated wlth enthusiasm. p3n) "The March Hare!" he repeated with enthusiasm. ===) ===============================^============== p1p) Birmingham's most widely circulated daily. p3n) Burmingham's most widely circulated daily. ===) =^======================================== p1p) pay too." p3n) pay, too." ===) ===^^^=^^^ p1p) firm of George L. Kirnball and from Dalrymple p3n) firm of George L. Kimball and from Dalrymple ===) ====================^^^^=^^^^^^^^^^^^^^^^^^^ p1p) the Echo?"' p3n) the Echo?" ===) ========== p1p) "This book was illuminated, bound, and p3n) "'This book was illuminated, bound, and ===) =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^^ p1p) manuscripts, and many a one is marred by misspelling p3n) manuscripts, and many a one is marred by mis-spelling ===) ============================================^^^^=^^^^ p1p) Mr. Cameron was as good as his word. p3n) MR. CAMERON was as good as his word. ===) =^===^^^^^^========================= p1p) "O.K.!" he said. "I talked with one of the p3n) "O. K.!" he said. "I talked with one of the ===) ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) Caesar did in Gaul, what Cyrus and the Silician p3n) C?sar did in Gaul, what Cyrus and the Silician ===) =^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) New York and was, I fancy, glad to find someone p3n) New York and was, I fancy, glad to find some ===) ============================================ p1p) who was interested and would appreciate p3n) one who was interested and would appreciate ===) ^^^==^^=^^^^^^^^^^^^^==^^^^^^^^^^^^^^^^^^^^ p1p) I have already explained, care much for reading; p3n) have already explained, care much for reading; ===) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^ p1p) ways at liberty to send contributions back with p3n) at liberty to send contributions back with ===) ^^^^^^^^^^^^^^^^^^=^^=^^^^^=^^^^^^^^^=^^^^ p1p) smoothed away his objectious until, upon a p3n) smoothed away his objections until, upon a ===) ==========================^=============== p1p) finer and more efiicient. It was, as Paul p3n) finer and more efficient. It was, as Paul ===) =================^======================= p1p) manager; the alumnae, now scattered in p3n) manager; the alumn?, now scattered in ===) ==================^^^^^^^^^^^=^^^^^^^ p1p) one passed through the school corridors, and p3n) one passed through the school corridors, and ` ===) ============================================^^ p1p) various sources one number after another of ` p3n) various sources one number after another of ===) =========================================== p1p) like to write up fires and aceidents and wear a p3n) like to write up fires and accidents and wear a ===) =============================^================= p1p) a under the ropes." p3n) under the ropes." ===) ^^^^^^^^^^^^^^^^^ p1p) into an incongruous garment. p3n) into an incongruous garment. Page 74. ===) ============================^^^^^^^^^ p1p) John Gutenburg,a native of Strasburg, who p3n) John Gutenburg, a native of Strasburg, who ===) ===============^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) was the principle of it is identical with that p3n) was, the principle of it is identical with that ===) ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) "But there are short cuts," argued Mr. Cameron. p3n) "But there are short outs," argued Mr. Cameron. ===) =====================^========================= p1p) at all. They get a scenario or resume of the p3n) at all. They get a scenario or r?sum? of the ===) ================================^===^======= p1p) citizens can read and write, and vast is p3n) citizens can read and write, and vast ===) ===================================== p1p) author the prey of vultures who p3n) author was the prey of vultures who ===) =======^^^=^^=^^^^^^^^^^^^^^^^^^^^^ p1p) When the accounts were found to be short, p3n) When the acounts were found to be short, ===) ===========^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) a patronizing scorn, For a press of the Echo's p3n) a patronizing scorn. For a press of the Echo's ===) ===================^========================== p1p) the contrary it naively confessed that it was p3n) the contrary it na?vely confessed that it was ===) ==================^========================== p1p) was no easy task. It was a thankless job, anywy p3n) was no easy task. It was a thankless job, anyway -- the ===) ==============================================^^^^^^^^^ p1p) -- the least interesting of any of the positions p3n) least interesting of any of the positions ===) ^^^^^^^^^^^^^^^^^^^^^^=^====^^^=^^^^^^^^^ p1p) "How is your paper coming on, Paul?," he p3n) "How is your paper coming on, Paul?" he ===) ===================================^^^^ p1p) "B -- u -- t-" stammered Paul and then p3n) "B -- u -- t -- " stammered Paul and then ===) ============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) "I -- I-" faltered Paul. p3n) "I -- I -- " faltered Paul. ===) =======^^^^^^^^^^^^^^^^^^^^ p1p) "I don't quite-" p3n) "I don't quite -- " ===) ==============^^^^^ p1p) "We'll talk no more about this matter today," p3n) "We'll talk no more about this matter to-day," ===) ========================================^^^^^^ p1p) fifty-dollar bond I have" p3n) fifty-dollar bond I have." ===) ========================^^ p1p) Mr. Carter winked p3n) Mr. Carter winked. ===) =================^ p1p) "I see," he said p3n) "I see," he said. ===) ================^ p1p) the machine's myriad advantages. wasn't it p3n) the machine's myriad advantages. Wasn't it ===) =================================^======== p1p) March Hare Would branch out and be made p3n) March Hare would branch out and be made ===) ===========^=========================== p1p) largest industries. we cannot do without p3n) largest industries. We cannot do without ===) ====================^=================== p1p) gig had won the election, it is true, but it had been p3n) had won the election, it is true, but it had been ===) ^^^=^^^=^^^=^^=^^^^^^^^^^^^^^^^^^^^^^=^^^^^^=^^^^ p1p) press rooms for striking oil proof when the p3n) press rooms for striking off proof when the ===) ==========================^^=============== p1p) Paul had had time to become really downhearted, p3n) Paul had had time to become really down-hearted, ===) =======================================^^^^^^^^^ p1p) their days. " I'm going to take you upstairs p3n) their days. "I'm going to take you upstairs ===) =============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) "I See" p3n) "I see." ===) ===^==^^ p1p) cardboard, a sort of papier-mache, and by forcing p3n) cardboard, a sort of papier-mach?, and by forcing ===) ================================^================ p1p) "I See." p3n) "I see." ===) ===^==== p1p) however, the Boston Post ventured an innovation p3n) however, the Boston Post ventured an innovation by ===) ===============================================^^^ p1p) by arranging its presses one over the other, p3n) arranging its presses one over the other, ===) ^^^=^^^==^^^^^^^^^^==^^^^^^^^^^^^^^^^^^^^ p1p) duty it was to load it on to a truck, carry it up- p3n) it duty it was to load it on to a truck, carry it up- ===) ^^^^^^^=^^^^^^=^=^^^^^=^^=^^=^^^^^^^^^^^^^^^^^=^^^^^^ p1p) cast, the half sections of stereotype were put p3n) cast, the sections of stereotype were put ===) ==========^^^^^^^^^^^^=^^^^^=^^=^^^^==^^^ p1p) and Paul Smiled in return. p3n) and Paul smiled in return. ===) =========^================ p1p) fine articles from parents and distant p3n) fine articles from patents and distant ===) =====================^================ p1p) alumnae. Judge Damon had taken to contributing p3n) alumn?. Judge Damon had taken to contributing ===) =====^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) and two of Burminghams graduates p3n) and two of Burmingham's graduates ===) =====================^^^^^^^^^^^^ p1p) own, was far from being the same thing as returning p3n) own, was far from being the same thing as returning it. ===) ===================================================^^^^ p1p) it. It was strange that it should be so p3n) It was strange that it should be so ===) ^=^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) "Because -- well -- it would be so yellow," p3n) "Because -- well -- it would be so darn yellow," ===) ===================================^^^^^^^^^^^^^ p1p) "What else could we sell it out for, fathead?" p3n) "What else could we sell it out for, fat-head?" ===) ========================================^^^^^^^ p1p) Deeker, rolling his eyes up to the ceiling with p3n) Decker, rolling his eyes up to the ceiling with ===) ==^============================================ p1p) with the boy?' p3n) with the boy? ===) ============= p1p) be confessing that he had failed in his mission, p3n) be confessing that he had failed in his mission, -- nay, ===) ================================================^^^^^^^^ p1p) -- nay, worse than that, that he had not even p3n) worse than that, that he had not even ===) ^^^^^^^^^^^^^^=^^^^^^^^^=^^^^^^^=^^^^ p1p) only that he dreaded... The knob turned p3n) only that he dreaded.... The knob turned ===) =======================^^^^^^^^^^^^^^^^^ p1p) hollowing them out and tilling them up again p3n) hollowing them out and filling them up again ===) =======================^==================== p1p) wont, in unselhsh fashion, to let every one else p3n) wont, in unselfish fashion, to let every one else ===) ==============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) the five hundredth-time Don had been caught p3n) the five hundredth -- time Don had been caught ===) ==================^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) me to deposit some money in the bank for him p3n) me to deposit some money in the bank for him -- a ===) ============================================^^^^^ p1p) -- a hundred-dollar bill. I put the envelope in p3n) hundred-dollar bill. I put the envelope in ===) ^^^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^ p1p) In fact," he continued, lapsing into seriousness," p3n) In fact," he continued, lapsing into seriousness, ===) ================================================= p1p) the younger generation teaches us p3n) "the younger generation teaches us ===) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) Carneron was a big enough man to be forgiving. p3n) Cameron was a big enough man to be forgiving. ===) ==^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ p1p) "An honest blunder is one thing; but premeditated p3n) "An honest blunder is one thing; but pre-meditated ===) ========================================^^^^^^^^^^ p1p) and joy to the crowning event of l920's p3n) and joy to the crowning event of 1920's ===) =================================^===== p1p) course, the far-tamed March Hare. Its advent p3n) course, the far-famed March Hare. Its advent ===) ================^=========================== p1p) when weary, sleepy, but triumphant, a half p3n) when weary, sleepy, but triumphant, a half-jubilant, ===) ==========================================^^^^^^^^^^ p1p) jubilant, half-sorrowful lot of girls and boys p3n) half-sorrowful lot of girls and boys ===) ^^^^^^^^^^^^^^^^=^^=^^^^^=^^^^^=^^^^ ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/10252551/attachment-0001.htm From piggy at netronome.com Wed Mar 19 14:23:20 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 19 Mar 2008 17:23:20 -0400 Subject: [gutvol-d] parallel -- paul and the printing press -- 02 In-Reply-To: References: Message-ID: <47E18448.1000708@netronome.com> Could I trouble you to calculate the number of changes which P1 and P1P made which were essentially identical? I'd like to see how well Polya's formula works. Bowerbird at aol.com wrote: > ok, let's cut right to the chase on this parallel test... > > this book did the normal d.p. p1/p2/p3 workflow... > > then p1 was repeated, from the original o.c.r. output. > > now, as i reported earlier, the two parallel versions of p1 > had 376 differences between them. i resolved all those, > by doing a quick visual check (without referring to scans, > so you can assume i made a few mistakes in there, sorry). ... From Bowerbird at aol.com Wed Mar 19 15:48:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 19 Mar 2008 18:48:03 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 02 Message-ID: piggy said: > Could I trouble you to calculate the number of changes > which P1 and P1P made which were essentially identical? i'd assume you want the number of _meaningful_changes_ they made which were identical, but that takes lots of work, because one has to weed out all the meaningless changes... and, if you'd accept the meaningless changes in the count, the number quite likely runs in the _thousands_ once again, which fractures the assumptions of any statistics you'd use... once d.p. straightens out its policies to eliminate all of the meaningless changes, that data will just fall into your lap... until then, the benefit of computing it won't justify the cost. -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/e38a10c3/attachment.htm From Bowerbird at aol.com Wed Mar 19 19:04:15 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 19 Mar 2008 22:04:15 EDT Subject: [gutvol-d] stopping perpetuity Message-ID: ok, we're done with iteration#5 of the perpetual proof. yay! for iteration#6, please fix the spacey ellipses on these pages: > 1 6 23 24 29 155 157 the last two -- pages 155 and 157 -- had spacey ellipses _introduced_, so let's jump on these pages quickly this time, and get them saved right, so we can pursue no-diff nirvana... but even more importantly, please pay attention to page 33! you will please find, on page 33, this line, and correct it as shown: > around for a couple of weeks. Then he came, into the shop > around for a couple of weeks. Then he came into the shop specifically, delete the comma after the words "he came". there is no comma there, folks. never was, never will be. there _is_ an eensy-teensy-weensy little speck on the scan; but it's _so_ small, i'm not sure how o.c.r. saw it as a comma. it's so scrawny it couldn't even be considered as a _period_... let alone have the "tail" that would turn it into a _comma_, but o.c.r. put a period there, and now it's _our_ job to take it out... this is the very last error in this book! the last one! please fix it! so remember page#33! if you're the first in on this iteration, in fact, click through until you get to page 33 and fix it _now_! then go back to fix the spacey ellipses on 1, 6, 23, 24, and 29. page#1 > screen! Maybe meteors ... More blips--and > fragile vehicles. Air puffed out ... and Nelson page#6 > --the first time ..." page#23 > the rough stuff to come, when we blast out! ... Hey, Eileen--you page#24 > So soon ... Pop...." page#29 > the Asteroid Belt ... Mars? That was the heebie-jeebie planet. page#155 > " ... Frank, Gimp, Two-and-Two, Paul, Mr. Reynolds, page#157 > can remember what's Out There ... Serene, bubb, Belt, oh yeah, on the one on page 155, delete the space on both sides. *** all the other pages are right, so don't mess with them, at all. because if you do, you better be darn sure you've got it right. if you introduce a new error, we _will_ hunt you down, son... so anyway, have a nice day... thank you for your cooperation! book-wide proofing rules! -bowerbird p.s. we saved formatting done by f1/f2 -- thanks, chaps! -- and we will be introducing that in one book-wide operation (no fuss, no muss, no diffs, just spliffs, we flyin' now, matey!), so we're close to finishing this book! keep up the good work! p.p.s. once we've stopped the perpetual proofing machine, we'll be able to step into the tear in the space-time curtain, and we're perched on the event horizon of the black hole... ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/29c8d751/attachment.htm From Bowerbird at aol.com Thu Mar 20 00:34:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 20 Mar 2008 03:34:03 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 03 Message-ID: i wrote most of this some time back. yes, it's still applicable. i told you there were reasons i'd tell you later. here they are. *** over on the "confidence in page" wiki-page on the d.p. wiki, in addition to the "perpetual p1" experiment i've discussed, they note the presence of two parallel proofing experiments. so... what to make of this?... first, parallel proofing works. it has an excellent track-record -- coming from the "double-punch" method of keypunchers -- and has been validated in several experiments performed by me and documented extensively, right on the forum boards at d.p. also, as i remarked a while back, the "perpetual p1" experiment gave us additional proof of the value of parallel proofing, since the regular-workflow and the p1-iterations were parallel modes which -- in combination -- found more errors than either alone. whether parallel proofing works _better_ than serial proofing is an open question. i'm not all too sure that it does, and since it involves redundant work -- i.e., having multiple proofers find and fix the same mistakes -- it doesn't appeal to me very much. however, since d.p. is now wasting the time and energy of its volunteers in so many blatant ways, this bit of redundancy in parallel proofing pales to near insignificance in comparison... so i _might_ be interested in these tests... at the same time, the purpose of the two parallel proofing tests is unclear enough that i cannot say for certain what it might be, so that has made me fairly reluctant to even look at their data... even worse, when i saw the o.c.r., i was appalled and dismayed. all of the blank lines between paragraphs were lost in this o.c.r.! meaning that the _proofers_ had to reinsert them _manually_! in both books! amazing! that's disgusting. an error like that in the execution of the o.c.r. should be fixed by the person who _did_ the o.c.r., not proofers. but this is typical of d.p. workflow. people make bad mistakes -- mistakes which they never should have made, which would be easy for them to fix -- yet the proofers have to clean it up. i'm not saying that it's _difficult_ for the proofers to have to repair something like this. it's just putting a cursor into the right place in the text-field and then pressing the return key, repeating for as many times as necessary on any given page. so it's trivially _easy_ to correct these. but it's also _numbing_ to have to fix literally hundreds and hundreds of such errors. and it's fully _unnecessary_ to have the errors in the first place. someone has just set the options wrong on the o.c.r. interface, which means that it's also quite demeaning. frankly, an insult. and talk about "error injection"! _this_, folks, is error injection! a proofer adding some spaces around an ellipse? aw-psshaw. that's kid-stuff. how much damage can you really do that way? but when that one checkbox in the o.c.r. settings box was wrong, literally hundreds and hundreds of _errors_ were _injected_ into this file. hundreds and hundreds! _that's_ how you "inject errors", my boy. as i said earlier, we need to determine _who_ did this and take them aside for a little chat where we will kindly instruct them what they did wrong, and have them promise to never do it again. because what happened here is inexcusable. and the error should've been corrected before _any_ of this text went in front of one single proofer. not one page. not one proofer. because this is simply unacceptable... un-ac-cep-ta-ble. totally... it shows extreme disrespect for the time and energy that are being _donated_ to the cause of _the_digitization_ of the _public_domain_. but wait... because the problem gets even worse... not only was the paragraphing lost in this o.c.r., due to carelessness, but also the o.c.r. itself shows page after page of _systematic_defects_. for the first parallel proofing experiment, paul and the printing press, there are _44_ pages that demonstrate clipping of one side of the text: > 16 18 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 > 52 56 72 76 78 80 82 84 86 88 90 96 98 100 166 170 > 172 174 178 180 188 190 192 194 196 198 200 the other parallel proofing book christopher and the clockmakers, is even worse, with some _56_ pages with text that is badly clipped: > 20 36 38 44 88 90 92 98 126 128 130 132 134 136 140 142 > 144 146 148 150 152 154 156 158 160 162 164 166 168 > 170 172 174 176 178 180 188 194 200 202 204 206 208 212 > 214 216 218 220 222 224 226 228 230 234 236 240 246 i've posted text from the "paul" pages, to show how terrible they are: > http://z-m-l.com/misc/paul_bad_pages.html in addition, i've appended one page of this text to this post for you... proofers had to first erase the junk, then type in the text from the scan, for dozens and dozens of lines, on dozens and dozens of pages. wow. i don't know about you, but _i_ would be downright embarrassed to even _show_ that o.c.r. to another person, let alone ask them to fix it. honestly, i don't know if it's so bad because the _scans_ were clipped, or whether the "scanning zones" were incorrectly set in the o.c.r. app. but whichever it was, _someone_ should have fixed that problem first, instead of just shrugging the shoulders and passing through bull crap for someone further down the line to clean up. this is just _disgusting_. and let me say it a third time, so it really sinks in. this is _disgusting_... so -- just like "planet strappers" -- where an incompetent human made a bad mistake by changing all of the em-dashes into en-dashes instead, leaving the poor proofers to _manually_ change 1,137 back, one at a time, here too (in two books!) an incompetent content provider has caused grief. whoever that "someone" might be, they should be ashamed of themselves. and it's not like i _picked_ these examples, due to some "agenda" of mine. all this research was conceived and conducted by d.p. people themselves, who seemingly have become immune to their incompetence, and consider themselves to be "justified" when they get angry at people who point it out. and it's not some "fluke" that these books were badly flawed either. the truth is, i've examined lots and lots of d.p. books at the various stages, and the vast majority of them are flawed in significant and pervasive ways, and these flaws dump tons and tons of unnecessary work on volunteers... the incompetence shown over there is -- only one word for it -- stunning. -bowerbird p.s. here's one page of the flawed text from "paul and the printing press". this book appears to have junk in the margin, rather than pages clipped, but the result is the same -- unnecessary work for the proofers to correct. -> p#036 ft r. So you've come to explore the repairing de- |artment, have you? The informality of the greeting was delightful (ho Christopher, and immediately his heart went out |gm the old Scotchman. |~(|V " I guess so, yes," smiled he. " I didn't know I |{was going to though. It just happened." V:'| " It's not a bad happen, perhaps. Make your- |jself at home, laddie. Here's a stool." | ~" I'd rather stand and watch you." |V` " But I sha'n't let you. It makes me nervous to V |{have somebody hanging over my shoulder and |jmaybe jogging my elbow. If you're to stay you |(must sit," was the brusque but not unkindly x |fanswer. (g41;t Somewhat crestfallen the boy slipped to the |(gotool and for a few moments remained immovable, |'Watching the workman's busy fingers. How care- xgsfully they moved--with what fascinating deftness |and rapidity! ~.^J*| " I see you are not one to keep hitching and |Jtwiddling around," the clockmaker presently re- t |arked, with a twinkle. " We shall get on | = ously together. I detest nervous people." A gig| " Are you fixing the clock Mr. Bailey was ask- |;-~|g about?" Christopher ventured. |sum;" Not just now, sonny. I am finishing up a | job. I shall go back to her in a minute, `;|Tjtowever. You can't just tinker her at will as you | common clocks. She has to be dreamed over." V it| " Dreamed over!" repeated Christopher, not a ` `ijl|attle puzzled. | ' "Aye, dreamed over! Well-nigh prayed over ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/7be9f41e/attachment-0001.htm From richfield at telkomsa.net Wed Mar 19 13:15:13 2008 From: richfield at telkomsa.net (Jon Richfield) Date: Wed, 19 Mar 2008 22:15:13 +0200 Subject: [gutvol-d] Gothic or Gothic? Thanks folks! Message-ID: <47E17451.9080401@telkomsa.net> I wrote asking for advice on scanning Fraktur. (Absent-mindedly claiming to have Omniscan, which afaik is the one universal and error-free scanning software package; I meant of course Omnipage, which is not. (Not bad actually, but...)) Steven desJardins wrote: >I (Jon) had asked: >> BTW, just as a matter of curiosity, what is the copyright situation with >> Hitler's works? I know that it has lapsed in Australia and presumably >> Canada, but it should nominally be in copyright in the US. Is it >> regarded as such, and if so, is it an academic question, or would it be >> enforced, and if so , by whom? Steve replied: According to Wikipedia, "The U.S. government seized the copyright during the Second World War as part of the Trading with the Enemy Act and in 1979, Houghton Mifflin, the U.S. publisher of the book, bought the rights from the government. " < Thanks Steve. I cannot help wondering whether HM made any profit on the deal; I seldom see a copy. ================ Robert Cicconetti replied: >Fraktur fonts are difficult to OCR well; I have not tried in a while, but I understand older versions of OCR software actually do better (for Finereader, it was v5 or v6; can't recall) as they make fewer assumptions about the typeface. There has also been some work done on the open-source OCR engine Tesseract by piggy, a member of DP; I have not used it myself so I cannot comment on how well it works as yet. I can say that I spent many hours trying to train FR7 to understand Fraktur and other blackletter fonts, and got absolutely nowhere.< Thanks Robert, That sounds like an admonition not to be in too big a hurry. Fortunately I have more on my fork at the moment than I can manage! =================== La Monte H.P. Yarroll wrote: > The OCR package tesseract now has usable fraktur support. You want to use the deu-f language package. If you find pages that don't OCR well, send them to me and I'll fix the tesseract training to work better with them.< Much thanks sir. I am archiving this note of course, and if we both survive long enough, I'll take you up on your helpfulness! ================== To all: I much appreciate your responses. Go well and thank you, Jon From piggy at netronome.com Thu Mar 20 06:47:20 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Thu, 20 Mar 2008 09:47:20 -0400 Subject: [gutvol-d] parallel -- paul and the printing press -- 02 In-Reply-To: References: Message-ID: <47E26AE8.9040404@netronome.com> Bowerbird at aol.com wrote: > piggy said: > > Could I trouble you to calculate the number of changes > > which P1 and P1P made which were essentially identical? > > i'd assume you want the number of _meaningful_changes_ > they made which were identical, but that takes lots of work, > because one has to weed out all the meaningless changes... You are correct in surmising that I'm interested in your "real errors" metric. > > and, if you'd accept the meaningless changes in the count, > the number quite likely runs in the _thousands_ once again, > which fractures the assumptions of any statistics you'd use... I think the wdiff alterations metric (changed + inserted + deleted as reported by wdiff -s) would be interesting and potentially useful. Presumably the ratio of "meaningless changes" to "real errors" is fairly consistent between the parallel rounds. > > once d.p. straightens out its policies to eliminate all of the > meaningless changes, that data will just fall into your lap... My focus is in devising an automated metric which can ignore most of the "meaningless changes". I think you will agree that automation is much easier to deploy than social engineering. That doesn't mean it isn't worth doing the social engineering, but technological changes have substantially lower inertia than social changes. > > until then, the benefit of computing it won't justify the cost. Have you played with ocrdiff? I think it has a really good chance of approximating your metric well enough to be usable. Even if you use a different starting point, I am very interested in a fully automated tool that can approximate your metric. If you feel that you have invested as much into this project as you find necessary, I'll understand. I certainly wish to thank you for your contributions to date. > > -bowerbird > From piggy at netronome.com Thu Mar 20 06:58:08 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Thu, 20 Mar 2008 09:58:08 -0400 Subject: [gutvol-d] stopping perpetuity In-Reply-To: References: Message-ID: <47E26D70.5090604@netronome.com> Bowerbird at aol.com wrote: > ok, we're done with iteration#5 of the perpetual proof. yay! > Wow! That was really fast. Hmm... A quick analysis shows that about half the work was done by a single relatively new proofer. This newby went from about 40 pages to over 130 working on this project. They were also spending something like 10-15 seconds per page rather than then 2-5 minutes per page that everybody else was applying. I think this may skew the results of this round a little. I'll PM them to thank them for their enthusiasm but to request closer attention next round. From piggy at netronome.com Thu Mar 20 07:09:39 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Thu, 20 Mar 2008 10:09:39 -0400 Subject: [gutvol-d] parallel -- paul and the printing press -- 03 In-Reply-To: References: Message-ID: <47E27023.1020208@netronome.com> Bowerbird at aol.com wrote: > i wrote most of this some time back. yes, it's still applicable. > > i told you there were reasons i'd tell you later. here they are. > ... > p.s. here's one page of the flawed text from "paul and the printing > press". > this book appears to have junk in the margin, rather than pages clipped, > but the result is the same -- unnecessary work for the proofers to > correct. > > -> p#036 > ft r. So you've come to explore the repairing de- > |artment, have you? > The informality of the greeting was delightful > (ho Christopher, and immediately his heart went out > |gm the old Scotchman. > |~(|V " I guess so, yes," smiled he. " I didn't know I > |{was going to though. It just happened." > V:'| " It's not a bad happen, perhaps. Make your- > |jself at home, laddie. Here's a stool." > | ~" I'd rather stand and watch you." > |V` " But I sha'n't let you. It makes me nervous to > V |{have somebody hanging over my shoulder and > |jmaybe jogging my elbow. If you're to stay you > |(must sit," was the brusque but not unkindly > x |fanswer. > (g41;t Somewhat crestfallen the boy slipped to the > |(gotool and for a few moments remained immovable, > |'Watching the workman's busy fingers. How care- > xgsfully they moved--with what fascinating deftness > |and rapidity! > ~.^J*| " I see you are not one to keep hitching and > |Jtwiddling around," the clockmaker presently re- > t |arked, with a twinkle. " We shall get on > | = ously together. I detest nervous people." > A gig| " Are you fixing the clock Mr. Bailey was ask- > |;-~|g about?" Christopher ventured. > |sum;" Not just now, sonny. I am finishing up a > | job. I shall go back to her in a minute, > `;|Tjtowever. You can't just tinker her at will as you > | common clocks. She has to be dreamed over." > V it| " Dreamed over!" repeated Christopher, not a > ` `ijl|attle puzzled. > | ' "Aye, dreamed over! Well-nigh prayed over If you would care to implement a gutter noise removal algorithm for tesseract, I would certainly be happy to see the contribution. Most libraries I borrow books from are not willing to let me cut their books up so that I can get perfectly flat scans. You skills at alienating your advocates continue to impress me. From piggy at netronome.com Thu Mar 20 07:14:06 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Thu, 20 Mar 2008 10:14:06 -0400 Subject: [gutvol-d] Gothic or Gothic? Thanks folks! In-Reply-To: <47E17451.9080401@telkomsa.net> References: <47E17451.9080401@telkomsa.net> Message-ID: <47E2712E.5030609@netronome.com> Jon Richfield wrote: > I wrote asking for advice on scanning Fraktur. (Absent-mindedly > claiming to have Omniscan, which afaik is the one universal and > error-free scanning software package; I meant of course Omnipage, which > is not. (Not bad actually, but...)) > ... I'm actively interested in improving tesseract OCR's fraktur performance. Could you point me at a few of your pages? I'll let you know how well we're doing with the limited fraktur training we have so far. From rolsch at verizon.net Thu Mar 20 08:17:29 2008 From: rolsch at verizon.net (Roland Schlenker) Date: Thu, 20 Mar 2008 10:17:29 -0500 Subject: [gutvol-d] parallel -- paul and the printing press -- 03 In-Reply-To: <47E27023.1020208@netronome.com> References: <47E27023.1020208@netronome.com> Message-ID: <200803201117.29396.rolsch@verizon.net> On Thursday 20 March 2008 10:09:39 am La Monte H.P. Yarroll wrote: > > If you would care to implement a gutter noise removal algorithm for > tesseract, I would certainly be happy to see the contribution. Most > libraries I borrow books from are not willing to let me cut their books > up so that I can get perfectly flat scans. If you use the program "unpaper" before OCR'ing. Most of the gutter noise is removed. Roland Schlenker From Bowerbird at aol.com Thu Mar 20 11:17:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 20 Mar 2008 14:17:42 EDT Subject: [gutvol-d] stopping perpetuity Message-ID: piggy said: > I think this may skew the results of this round a little. you look very very closely and let me know if it does... :+) -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/9a5a8494/attachment.htm From nwolcott2ster at gmail.com Thu Mar 20 11:38:23 2008 From: nwolcott2ster at gmail.com (Norm Wolcott) Date: Thu, 20 Mar 2008 13:38:23 -0500 Subject: [gutvol-d] parallel -- paul and the printing press -- 03 References: <47E27023.1020208@netronome.com> <200803201117.29396.rolsch@verizon.net> Message-ID: <001701c88ab9$af97c080$660fa8c0@atlanticbb.net> Is unpaper available for us windows users? nwolcott2 at post.harvard.edu ----- Original Message ----- From: "Roland Schlenker" To: "Project Gutenberg Volunteer Discussion" Sent: Thursday, March 20, 2008 10:17 AM Subject: Re: [gutvol-d] parallel -- paul and the printing press -- 03 > On Thursday 20 March 2008 10:09:39 am La Monte H.P. Yarroll wrote: > > > > If you would care to implement a gutter noise removal algorithm for > > tesseract, I would certainly be happy to see the contribution. Most > > libraries I borrow books from are not willing to let me cut their books > > up so that I can get perfectly flat scans. > > If you use the program "unpaper" before OCR'ing. Most of the gutter noise is > removed. > > Roland Schlenker > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From grythumn at gmail.com Thu Mar 20 11:46:46 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Thu, 20 Mar 2008 14:46:46 -0400 Subject: [gutvol-d] parallel -- paul and the printing press -- 03 In-Reply-To: <001701c88ab9$af97c080$660fa8c0@atlanticbb.net> References: <47E27023.1020208@netronome.com> <200803201117.29396.rolsch@verizon.net> <001701c88ab9$af97c080$660fa8c0@atlanticbb.net> Message-ID: <15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com> It's easy enough to build binaries (I use cygwin). I don't know if anyone has packaged them or built a gui around it. I do recommend turning down the defaults a bit.. I think it was tuned for processing scanned photocopies, and it is rather overaggressive on my scans. Bob On Thu, Mar 20, 2008 at 2:38 PM, Norm Wolcott wrote: > Is unpaper available for us windows users? > > nwolcott2 at post.harvard.edu > ----- Original Message ----- > From: "Roland Schlenker" > To: "Project Gutenberg Volunteer Discussion" > Sent: Thursday, March 20, 2008 10:17 AM > Subject: Re: [gutvol-d] parallel -- paul and the printing press -- 03 > > > > On Thursday 20 March 2008 10:09:39 am La Monte H.P. Yarroll wrote: > > > > > > If you would care to implement a gutter noise removal algorithm for > > > tesseract, I would certainly be happy to see the contribution. Most > > > libraries I borrow books from are not willing to let me cut their > books > > > up so that I can get perfectly flat scans. > > > > If you use the program "unpaper" before OCR'ing. Most of the gutter > noise > is > > removed. > > > > Roland Schlenker > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/d3c9fdd9/attachment-0001.htm From Bowerbird at aol.com Thu Mar 20 11:50:11 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 20 Mar 2008 14:50:11 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 03 Message-ID: piggy said: > If you would care to implement a gutter noise removal algorithm > for tesseract, I would certainly be happy to see the contribution. oh please no. you're using _tesseract_ to o.c.r. books? why in the world would you use a _beta_ o.c.r. program? because it didn't cost you anything? it's costing your volunteer proofers _lots_ of time and energy, time and energy they donate, in good faith, to a good cause... when you treat these people like guinea piggies, they will leave, and never come back again. is that really what you want to do? even if it's what you want to do, do you have the _right_ to do it? the d.p. retention rate sucks, in spite of the cultish "friendliness". maybe some people might wonder why, but i'm not one of them. so, what to do? there are plenty of d.p. volunteers who have abbyy finereader -- the acknowledged leader in accuracy -- so have _them_ do the scanning if that's what it takes to get decent o.c.r. output... besides, the loss of paragraphing, which was a fault noticeable in 10 seconds, is not something tesseract always causes, is it? so that was an _operator_ mistake, error injection at its finest. (and if tesseract _does_ always lose the paragraphing, then that's even _more_ reason why nobody should use it at d.p.) > Most libraries I borrow books from are > not willing to let me cut their books up > so that I can get perfectly flat scans. oh please. just because you're off chasing after the broom of the wicked witch of the west doesn't mean you can bring mr. strawman into the argument. you don't need "perfectly" flat scans. you just need scans that don't have a ton of noise. if the scans were bad for these books, then find better scans! or do consultation with the d.p. image-manipulation experts. or find another d.p. volunteer who can _create_ better scans... but do _not_ just accept the bad scans and dump awful o.c.r. on the proofers, and expect them to essentially do a type-in. because that is _disgusting_. and hey, i'm really sorry if that hurts your feelings, but it's the truth, and you need to know it. and lots and lots of _other_ d.p. content providers need to be confronted with the truth of the incompetence of their efforts. > You skills at alienating your advocates continue to impress me. with advocates like this, who needs detractors? i have _truth_ on my side. and _common_sense_. and tons and tons of data i can display any time... and more lurkers who will step out from the shadows in support of me if i ask them than you might expect... which is _not_ to say that i would reject any "advocate". but if you really wanna be an advocate of mine, then you'd better understand that point #1 on my plan is to get good scans, and point #2 is to do quality o.c.r. there are 8 points after that, but start with #1 and #2. -bowerbird > just to remind them, off the dome, this is what needs to be done: > 1.? ensure you have decent scans, and name them intelligently. > 2.? use a decent o.c.r. program, and ensure quality results. > 3.? do not tolerate bad text handling by content providers. > 4.? do a decent post-o.c.r. cleanup, before _any_ proofing. > 5.? retain linebreaks (don't rejoin hyphenates or clothe em-dashes). > 6.? change the ridiculous ellipse policy to something sensible. > 7.? stop doing small-cap markup with no semantic meaning. > 8.? i forget what 8 was for. > 9.? retain pagenumber information, in an unobtrusive manner. > 10.? format the ascii version using light markup, for auto-html. ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/228da4af/attachment.htm From Bowerbird at aol.com Thu Mar 20 13:25:34 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 20 Mar 2008 16:25:34 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 02 Message-ID: first of all, i made a big mistake yesterday... i put on a webpage the o.c.r. from the bad pages of "paul and the printing press". or so i thought. in actuality, i pulled pages from the wrong book -- the "christopher and the clockmakers" book which is the second "parallel" experiment text... so... if you took a look at that o.c.r. and thought "i dunno, it doesn't look all _that_ bad to me...", then you might want to go take another gander: >?? http://z-m-l.com/misc/paul_bad_pages.html also, as i intended to do in my previous message, i have appended one page of o.c.r. -- page 36 -- from the "paul and the printing press" book, and i've now also included the text as it was corrected, so you get the full flavor of how bad the o.c.r. is... (because of _human_error_ while doing the o.c.r.) to repeat, i'd be embarrassed to show this to people, let alone actually shovel it to volunteers to _correct_. oh, and remember, because this was research on _parallel_proofing_, p1 proofers were subjected to this onerous o.c.r. _twice_, which is a real travesty. *** piggy said: > You are correct in surmising that > I'm interested in your "real errors" metric. as i told you, weed out the meaningless stuff, and whatever you have left are "the real errors". i didn't do this for "paul and the printing press". the differences i showed were just _differences_, i didn't present them as "errors". however, when you see the paired lines, it's usually fairly easy to see which of those two is the one that is in error. (for the record, i was able to identify the bad line successfully on 77 of the 87 line-pairs i'd listed. and this file, with its cut-off lines, was really hard.) > I think the wdiff alterations metric > (changed + inserted + deleted as reported by wdiff -s) > would be interesting and potentially useful. yeah, right. on the garbage o.c.r. you have in the parallel tests, you aren't gonna find _any_ statistic that's "useful". garbage-in-garbage-out. it's a law you can't break. > Presumably the ratio of "meaningless changes" > to "real errors" is fairly consistent > between the parallel rounds. get real. you cannot even reliably _count_ the number of meaningless changes when you have garbage o.c.r. i've appended the o.c.r. of one page, and the p3 output. ask ten people to count the "meaningful" changes and you'll get ten different answers. and sure, you _could_ settle onto one metric, and use just that, but you won't get any predictive power out of it, not in the big picture. and when you throw in the d.p. policy meaninglessness, things like rejoining hyphenates and clothing em-dashes and that rot, you're just piling on more ridiculousness... if you get anything out of that mess, god blessed you... but strip away that nonsense, and things are crystal clear. with good scans and good o.c.r., pages have -- at most -- a half-dozen errors, and the p1 proofers get most of them, and subsequent rounds whittle 'em away until there are 0. real errors get found, and they get fixed. and that's it... it happens on page after page, day after day, over at d.p. > My focus is in devising an automated metric > which can ignore most of the "meaningless changes". good luck with that. > I think you will agree that automation is > much easier to deploy than social engineering. well, if -- by "social engineering" -- you mean "convincing" the-powers-that-be (as they humorously call themselves) over at d.p. to change their evil ways, well then maybe just maybe you will find it easier to develop an automated metric. but you'll be a flying piggy by that time, and will most likely find it more fulfilling to be flitting among the clouds instead. > Have you played with ocrdiff? don't know what it is, and probably don't much care. since i don't allow any meaningless noise to get inside my data in the first place, i have no need to tune it out. good data just falls into my lap. yes, it really is that simple. > I think it has a really good chance of > approximating your metric well enough to be usable. yeah, well, you let me know when that happens. > Even if you use a different starting point, I am very interested > in a fully automated tool that can approximate your metric. i'm curious why you keep calling it a "metric"... as if it were some kind of stand-in variable for o.c.r. errors. it's not. i simply locate the o.c.r. errors, and i count them... it's not hard to locate the o.c.r. errors. it's ridiculously easy... you just take the text as it's been proofed as close to perfection as you can get it, and then you compare it to the original o.c.r., and the places where the lines differ might be the o.c.r. errors... (they might also be places where the transcriber made a change.) note, as i've remarked before, if you go look at the page-images, you'll often find that the o.c.r. didn't really make an "error" per se, but (for example) recognized a speck as a period, or a comma, or made some other recognition decision that's fully understandable. oh, it makes _actual_errors_ too -- sometimes inexplicable ones -- but for the most part, it's usually easy to see why it did what it did... *** and certainly -- in the case of the page i have appended below -- you can understand why i'm reluctant to call those "o.c.r. errors"... no siree, those "errors" are due to what we programmers label as p.e.b.k.a.c. -- a.k.a., "problem exists between keyboard and chair". -bowerbird p.s. here's the o.c.r. from page 36 of "paul and the printing press", followed by the text as it was proofed by the p1-p2-p3 workflow... -> p#036 In spite of Paul's optimism he was more than of Melvil1e's opinion. g g_: Mr. Carter was well known throughout ingharn as a stern, austere man whom = le feared rather than loved. He had the of being shrewd, closefsted, and at a bargain,-a person of few friends g-gig? many enemies. I-Ie was a great lighter, t; ng a grudge to any length for the sheer Rk ure of gratifying it. Therefore many a re mature and courageous promoter than ii Cameron had shrunk from approaching with a business proposition. Even Paul did not at all relish the mission be if ore him; he was, however, too manly to shirk ^Rl Hence that evening, directly after dinner, fjmade his way to the mansion of Mr. Arthur ` by Carter, the wealthy owner of the Echo, jirmingham's most widely circulated daily. fggortunately or unfortunately-Paul was in which-the capitalist was at home at leisure; and with beating heart the boy T;' ushered into the presence of this illustrious eman. Carter greeted him politely but with no ixdiality. So you 're Paul Cameron. I 've had deal- Jl;T t with your father," he remarked dryly. $@1 A = t can I do for you?" Iii courage ebbed. The question was j= = and direct, demanding a reply of similar In spite of Paul's optimism he was more than half of Melville's opinion. Mr. Carter was well known throughout Burmingham as a stern, austere man whom people feared rather than loved. He had the reputation of being shrewd, close-fisted, and sharp at a bargain,--a person of few friends and many enemies. He was a great fighter, carrying a grudge to any length for the sheer pleasure of gratifying it. Therefore many a more mature and courageous promoter than Paul Cameron had shrunk from approaching him with a business proposition. Even Paul did not at all relish the mission before him; he was, however, too manly to shirk it. Hence that evening, directly after dinner, he made his way to the mansion of Mr. Arthur Presby Carter, the wealthy owner of the Echo, Burmingham's most widely circulated daily. Fortunately or unfortunately--Paul was uncertain which--the capitalist was at home and at leisure; and with beating heart the boy was ushered into the presence of this illustrious gentleman. Mr. Carter greeted him politely but with no cordiality. "So you're Paul Cameron. I've had dealings with your father," he remarked dryly. "What can I do for you?" Paul's courage ebbed. The question was crisp and direct, demanding a reply of similar ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/4e58af3f/attachment.htm From donovan at abs.net Thu Mar 20 15:27:44 2008 From: donovan at abs.net (D Garcia) Date: Thu, 20 Mar 2008 18:27:44 -0400 Subject: [gutvol-d] Unpaper for DOS/Windows (WAS: Re: parallel -- paul and the printing press -- 03) In-Reply-To: <001701c88ab9$af97c080$660fa8c0@atlanticbb.net> References: <200803201117.29396.rolsch@verizon.net> <001701c88ab9$af97c080$660fa8c0@atlanticbb.net> Message-ID: <200803201827.44543.donovan@abs.net> On Thursday 20 March 2008 14:38, Norm Wolcott wrote: > Is unpaper available for us windows users? Back in 2006 I built a standalone DOS executable of it for the Windows folks. zip file here: http://www.pgdp.org/~donovan/unpaper-0.2.zip That is probably not up to date with the current source, but if you have the lcc-win32 or other compiler (not cygwin, you can get into dll-hell going that route) you can always rebuild from the current source. From gbnewby at pglaf.org Thu Mar 20 18:25:13 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Thu, 20 Mar 2008 18:25:13 -0700 Subject: [gutvol-d] Moderation/censorship Message-ID: <20080321012513.GB22705@mail.pglaf.org> I received a request to moderate or otherwise quiet Bowerbird on gutvol-d. This request was based on an opinion that he has been behaving poorly. Unfortunately I somehow deleted the message, so am not certain who it was from. Therefore, I'm responding here: The answer is: no, I will not turn on moderation or remove list members at this time. This topic has been hashed over several times in the past, so I'm not going to try to retype the history..maybe some other people would like to. The bottom line is that IF you want a moderated list, put together a *team* of moderators and we'll make a moderated list. I'm personally unwilling to take on that responsibility. (The Project Wombat lists are good examples of multiple lists with different levels of moderation.) I'm happy to set up additional mailing lists. The moderated list could, in one scenario, consist mostly of filtered postings from gutvol-d. It's up to the moderator team. I do insist that any list, moderated or not, be open to any and all subscribers. But in a moderated list, the moderators decide which messages go to the list, and which ones do not. People who do not like the moderation policy or practice are, of course, welcome to start their own list. Because I have not been following the threads closely, I am not going to offer an opinion on appropriate versus inappropriate behavior. It seems there has been plenty of rough talk, though, which I simply delete. As always, I encourage people to make maximum use of their email systems to block individuals, threads, subjects, etc. which they would rather not see. -- Greg From piggy at netronome.com Thu Mar 20 18:26:59 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Thu, 20 Mar 2008 21:26:59 -0400 Subject: [gutvol-d] parallel -- paul and the printing press -- 03 In-Reply-To: <15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com> References: <47E27023.1020208@netronome.com> <200803201117.29396.rolsch@verizon.net> <001701c88ab9$af97c080$660fa8c0@atlanticbb.net> <15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com> Message-ID: <47E30EE3.2040208@netronome.com> Robert Cicconetti wrote: > It's easy enough to build binaries (I use cygwin). I don't know if > anyone has packaged them or built a gui around it. > > I do recommend turning down the defaults a bit.. I think it was tuned > for processing scanned photocopies, and it is rather overaggressive on > my scans. I have been limiting its use to very narrow cases because of how aggressive it is. What settings do you find best for 8-bit grayscale documents scanned from original books? I have not found settings I have been happy with. > > Bob > > On Thu, Mar 20, 2008 at 2:38 PM, Norm Wolcott > wrote: > > Is unpaper available for us windows users? > > nwolcott2 at post.harvard.edu > ----- Original Message ----- > From: "Roland Schlenker" > > To: "Project Gutenberg Volunteer Discussion" > > > Sent: Thursday, March 20, 2008 10:17 AM > Subject: Re: [gutvol-d] parallel -- paul and the printing press -- 03 > > > > On Thursday 20 March 2008 10:09:39 am La Monte H.P. Yarroll wrote: > > > > > > If you would care to implement a gutter noise removal > algorithm for > > > tesseract, I would certainly be happy to see the contribution. > Most > > > libraries I borrow books from are not willing to let me cut > their books > > > up so that I can get perfectly flat scans. > > > > If you use the program "unpaper" before OCR'ing. Most of the > gutter noise > is > > removed. > > > > Roland Schlenker > > > ------------------------------------------------------------------------ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Thu Mar 20 18:51:14 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 20 Mar 2008 21:51:14 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 05 Message-ID: a "metric"? you want a "metric"? i can give you a "metric". one that's so easy to compute a computer can even do it. i'm not gonna process the numbers for the whole book, but i think it's fine to look at a single page for education. so let's look at page 36 from "paul and the printing press". there are 31 lines on this page. thirty-one. 31. four versions of the page's text -- o.c.r., p1, p2, and p3 -- are presented below, for your enjoyment and edification... refer to the versions frequently, as need be, as you follow along with this brief analysis of the life of this little page... after o.c.r., the page went through the normal p1-p2-p3, so i have used these labels herein: ocr, p1n, p2n, and p3n. *** o.c.r. got 1 line right... out of 31 lines... pathetic... for the record: > In spite of Paul's optimism he was more than *** p1n fixed 28 of the 30 error-ridden lines correctly, except 2: ocr> of being shrewd, closefsted, and p1n> of being shrewd, close-fisted, and <- fixed scanno, but... p2n> reputation of being shrewd, close-fisted, and <-bingo p3n> reputation of being shrewd, close-fisted, and in this first case, p1n fixed the scanno that was there, but also missed the fact that a (big) word had been completely cut off from the text. missing words can be difficult to spot, which is why you want to make sure o.c.r. doesn't cut any off. also, missing words are extremely hard to catch automatically. p2 came along and fixed this error. thank you. the second line that where p1 missed error was this: ocr> So you 're Paul Cameron. I 've had deal- p1n> So you're Paul Cameron. I've had dealings <- rejoined, but p2n> So you're Paul Cameron. I've had dealings <- missed one... p3n> "So you're Paul Cameron. I've had dealings <-bingo in this second case, p1n rejoined the end-line-hyphenate fine, but missed that an opening quotemark was absent at the start of the line. likewise, p2 missed it too. p3 fixed this error, so it was a persistent one. however, this kind of error is _easy_ to spot via automated analysis -- a simple routine detects unbalanced quotemarks in a paragraph -- so we don't have to take it seriously. so even the two errors that "slipped by" could well have been avoided. good scanning would have prevented that word from being chopped, and a good clean-up tool would have _alerted_ us to the quotemark... all in all, a _fantastic_ job by p1. but boy wasn't that o.c.r. awful? my goodness! awful! p1 had to fix 30 out of 31 lines! that's 96.77% _bad_. phew! there's your metric, the percentage of lines changed in p1. you told everyone you wanted a metric. there's your metric. *** p2n corrected 1 of the 2 errors with which it was faced. well, it could have done better. and it could have done worse. not much more that you can say after that. *** p3n corrected the 1 error with which it was faced... (if there are more errors on the page -- it's possible -- then all of the normal p1-p2-p3 rounds missed them.) *** ok, so i'm just gonna throw out a round guesstimate and say that p1 made 98 corrections... and we know p2 and p3 made 1 each... so p1 had an accuracy-rate of 98%, p2 had a rate of 50%, and p3 -- by definition -- had an accuracy-rate of 100%. how 'bout those p1 proofers? really something, aren't they? that's the trend you always see. p1 is big, then whittle away. and like i said, they do it day in and day out, on page after page. they rock... *** ok, so we did that little exercise on one page. but you can do it on _all_ of the pages if you want to, because it's really very easy. just go to the project page for this book: > http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea make sure you've chosen "detail level 4", at the top of the page, and then go down lower to follow the _progress_ of each page... click on the "diff" between o.c.r. and p1 to see the changes made. then click the "diff" between p1 and p2 to see the changes made. and finally click the "diff" between p2 and p3 to see the changes. a "no diff" between two spots means that no changes were made. and you'll see it time after time. p1 is big, p2 and p3 whittle away. -bowerbird ============================================ the text for page 36 from "paul and the printing press" experiment for page 36. (actually file#36, which is d.p. lingo for, um, page 19.) the text is given from o.c.r., after p1n, after p2n, and then after p3n... ============================================ -> p#036 -- ocr In spite of Paul's optimism he was more than of Melvil1e's opinion. g g_: Mr. Carter was well known throughout ingharn as a stern, austere man whom = le feared rather than loved. He had the of being shrewd, closefsted, and at a bargain,-a person of few friends g-gig? many enemies. I-Ie was a great lighter, t; ng a grudge to any length for the sheer Rk ure of gratifying it. Therefore many a re mature and courageous promoter than ii Cameron had shrunk from approaching with a business proposition. Even Paul did not at all relish the mission be if ore him; he was, however, too manly to shirk ^Rl Hence that evening, directly after dinner, fjmade his way to the mansion of Mr. Arthur ` by Carter, the wealthy owner of the Echo, jirmingham's most widely circulated daily. fggortunately or unfortunately-Paul was in which-the capitalist was at home at leisure; and with beating heart the boy T;' ushered into the presence of this illustrious eman. Carter greeted him politely but with no ixdiality. So you 're Paul Cameron. I 've had deal- Jl;T t with your father," he remarked dryly. $@1 A = t can I do for you?" Iii courage ebbed. The question was j= = and direct, demanding a reply of similar -> p#036 -- p1n In spite of Paul's optimism he was more than half of Melville's opinion. Mr. Carter was well known throughout Burmingham as a stern, austere man whom people feared rather than loved. He had the of being shrewd, close-fisted, and sharp at a bargain,--a person of few friends and many enemies. He was a great fighter, carrying a grudge to any length for the sheer pleasure of gratifying it. Therefore many a more mature and courageous promoter than Paul Cameron had shrunk from approaching him with a business proposition. Even Paul did not at all relish the mission before him; he was, however, too manly to shirk it. Hence that evening, directly after dinner, he made his way to the mansion of Mr. Arthur Presby Carter, the wealthy owner of the Echo, Burmingham's most widely circulated daily. Fortunately or unfortunately--Paul was uncertain which--the capitalist was at home and at leisure; and with beating heart the boy was ushered into the presence of this illustrious gentleman. Mr. Carter greeted him politely but with no cordiality. So you're Paul Cameron. I've had dealings with your father," he remarked dryly. "What can I do for you?" Paul's courage ebbed. The question was crisp and direct, demanding a reply of similar -> p#036 -- p2n In spite of Paul's optimism he was more than half of Melville's opinion. Mr. Carter was well known throughout Burmingham as a stern, austere man whom people feared rather than loved. He had the reputation of being shrewd, close-fisted, and sharp at a bargain,--a person of few friends and many enemies. He was a great fighter, carrying a grudge to any length for the sheer pleasure of gratifying it. Therefore many a more mature and courageous promoter than Paul Cameron had shrunk from approaching him with a business proposition. Even Paul did not at all relish the mission before him; he was, however, too manly to shirk it. Hence that evening, directly after dinner, he made his way to the mansion of Mr. Arthur Presby Carter, the wealthy owner of the Echo, Burmingham's most widely circulated daily. Fortunately or unfortunately--Paul was uncertain which--the capitalist was at home and at leisure; and with beating heart the boy was ushered into the presence of this illustrious gentleman. Mr. Carter greeted him politely but with no cordiality. So you're Paul Cameron. I've had dealings with your father," he remarked dryly. "What can I do for you?" Paul's courage ebbed. The question was crisp and direct, demanding a reply of similar -> p#036 -- p3n In spite of Paul's optimism he was more than half of Melville's opinion. Mr. Carter was well known throughout Burmingham as a stern, austere man whom people feared rather than loved. He had the reputation of being shrewd, close-fisted, and sharp at a bargain,--a person of few friends and many enemies. He was a great fighter, carrying a grudge to any length for the sheer pleasure of gratifying it. Therefore many a more mature and courageous promoter than Paul Cameron had shrunk from approaching him with a business proposition. Even Paul did not at all relish the mission before him; he was, however, too manly to shirk it. Hence that evening, directly after dinner, he made his way to the mansion of Mr. Arthur Presby Carter, the wealthy owner of the Echo, Burmingham's most widely circulated daily. Fortunately or unfortunately--Paul was uncertain which--the capitalist was at home and at leisure; and with beating heart the boy was ushered into the presence of this illustrious gentleman. Mr. Carter greeted him politely but with no cordiality. "So you're Paul Cameron. I've had dealings with your father," he remarked dryly. "What can I do for you?" Paul's courage ebbed. The question was crisp and direct, demanding a reply of similar ================================================ wondering what happened with the parallel p1 proofing on page 36? well, the parallel proofing didn't do _quite_ as good as the normal p1; they missed _3_ errors, compared to the normal p1 missing just _2_... the _good_ news, however, is that the second parallel proofing _caught_ the 2 errors missed by the _first_ parallel proofing, so -- taken together -- they achieved perfection. in 2 p1 rounds! as opposed to 3 normal rounds. and they had no useful workcheck, either, with "good" and "bad" word-lists. i tell you, those p1 proofers _rock_... for the record, here's the 3 errors missed by the p1p proofers: p1p> half of Melvil1e's opinion. p1p> carryng a grudge to any length for the sheer p1p> Birmingham's most widely circulated daily. oh, and just so you know, spellcheck would catch all 3 of those errors. neat. ================================================ ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/41170224/attachment.htm From grythumn at gmail.com Thu Mar 20 18:53:52 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Thu, 20 Mar 2008 21:53:52 -0400 Subject: [gutvol-d] parallel -- paul and the printing press -- 03 In-Reply-To: <47E30EE3.2040208@netronome.com> References: <47E27023.1020208@netronome.com> <200803201117.29396.rolsch@verizon.net> <001701c88ab9$af97c080$660fa8c0@atlanticbb.net> <15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com> <47E30EE3.2040208@netronome.com> Message-ID: <15cfa2a50803201853u2fe4eb9eh1aec8f7601ed3287@mail.gmail.com> On Thu, Mar 20, 2008 at 9:26 PM, La Monte H.P. Yarroll wrote: > Robert Cicconetti wrote: > > It's easy enough to build binaries (I use cygwin). I don't know if > > anyone has packaged them or built a gui around it. > > > > I do recommend turning down the defaults a bit.. I think it was tuned > > for processing scanned photocopies, and it is rather overaggressive on > > my scans. > > I have been limiting its use to very narrow cases because of how > aggressive it is. > > What settings do you find best for 8-bit grayscale documents scanned > from original books? I have not found settings I have been happy with. I don't really use it to clean scans from original books; Abbyy FR's adaptive thresholding works well for almost all of my text pages. (It sucks on illos, of course) I have used unpaper to split 2-up pages from original scans, and occasionally use it to clean up microfilm scans (~600 DPI b/w). Blackfilter set to about 0.98, intensity 10 (Depends on the book), plus deskewing and page splitting, is something I've used in the past for microfilm scans. I think I've used it on some other projects, but I can't find any notes on the settings. unpaper's deskewing, at least the version I'm using, is slow and definitely a memory hog. I generally turn off qpixel. R C -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/2cbeeb1e/attachment-0001.htm From rolsch at verizon.net Thu Mar 20 18:56:48 2008 From: rolsch at verizon.net (Roland Schlenker) Date: Thu, 20 Mar 2008 20:56:48 -0500 Subject: [gutvol-d] parallel -- paul and the printing press -- 03 In-Reply-To: <47E30EE3.2040208@netronome.com> References: <15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com> <47E30EE3.2040208@netronome.com> Message-ID: <200803202156.49137.rolsch@verizon.net> On Thursday 20 March 2008 9:26:59 pm La Monte H.P. Yarroll wrote: > Robert Cicconetti wrote: > > I do recommend turning down the defaults a bit.. I think it was tuned > > for processing scanned photocopies, and it is rather overaggressive on > > my scans. > > I have been limiting its use to very narrow cases because of how > aggressive it is. > > What settings do you find best for 8-bit grayscale documents scanned > from original books? I have not found settings I have been happy with. > > > Bob For an original books that I have scanned myself, I input the scans directly into FineReader. However, for scan-sets from Early Canadiana Online, that I have recently ORC'ed for DP-C, that scans were of very, very poor quality. I used aggressive unpaper options, "-bt 0.85 -ni 8 -li 0.03", which at times removed areas of text. Those scans, missing areas of text, were reprocessed without using unpaper by simply removed the unnecessary edge areas. Roland Schlenker From donovan at abs.net Thu Mar 20 20:49:14 2008 From: donovan at abs.net (D Garcia) Date: Thu, 20 Mar 2008 23:49:14 -0400 Subject: [gutvol-d] Moderation/censorship In-Reply-To: <20080321012513.GB22705@mail.pglaf.org> References: <20080321012513.GB22705@mail.pglaf.org> Message-ID: <200803202349.15039.donovan@abs.net> On Thursday 20 March 2008 21:25, Greg Newby wrote: > I received a request to moderate or otherwise quiet > Bowerbird on gutvol-d. This request was based on an > opinion that he has been behaving poorly. > The answer is: no, I will not turn on moderation or > remove list members at this time. In contrast, over on DP where bowerbird created several sockpuppet accounts to bypass his unprecented ban there, he recently used one to intentionally sabotage the experiment in continuous proofing which he has been making such noise about here. That account and the others he has been using have also been disabled according to Juliet Sutherland's instruction to maintain the ban, especially given the escalation of his behavior there from trolling to actual sabotage. (Or if you prefer, his stooping to that behavior, since online, he's a bird.) David From marcello at perathoner.de Fri Mar 21 00:11:48 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri, 21 Mar 2008 08:11:48 +0100 Subject: [gutvol-d] Moderation/censorship In-Reply-To: <200803202349.15039.donovan@abs.net> References: <20080321012513.GB22705@mail.pglaf.org> <200803202349.15039.donovan@abs.net> Message-ID: <47E35FB4.2070901@perathoner.de> D Garcia wrote: > In contrast, over on DP where bowerbird created several sockpuppet accounts to > bypass his unprecented ban there, he recently used one to intentionally > sabotage the experiment in continuous proofing which he has been making such > noise about here. That clearly shows the level of confidence he has in his own theories. He had to go and skew the stats. Tzzzz. -- Marcello Perathoner webmaster at gutenberg.org From ralf at ark.in-berlin.de Fri Mar 21 01:32:05 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Fri, 21 Mar 2008 09:32:05 +0100 Subject: [gutvol-d] tesseract and ligatures Message-ID: <20080321083205.GC18003@ark.in-berlin.de> BTW, since we're just at it, there's another nontriviality involved with some scans that are out there. Some scans you can get have sort of shadows, just like antialiasing which is practically impossible to remove. This leads to characters sticking together like ligatures do. I thought OK why not, then let's just train tesseract for those character groups, it will only take a bit more effort... Result was, tesseract is not able to train ligatures, i.e. groups of characters, at all! It's hard wired to single characters. It *appears* at first to be able to train pairs of characters but this is an illusion because if it's a pair of ASCII chars tesseract won't barf because it thinks it's one UTF-8 char. I wonder if they even thought of multibyte UTF-8? Summary for me: Tesseract is unusable without ligature support. This is a major bug. Regards, ralf From Bowerbird at aol.com Fri Mar 21 02:13:57 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 21 Mar 2008 05:13:57 EDT Subject: [gutvol-d] Moderation/censorship Message-ID: donovan/david said: > In contrast, over on DP where bowerbird > created several sockpuppet accounts to > bypass his unprecented ban there, um... he probably meant "unprecedented"... but... sockpuppet accounts? to bypass a ban? excuse me? i was explicitly _not_ prohibited from _proofing_ when i was "banned" -- i was only restricted from _posting_ in the forums... go back and look it up, if you must... > he recently used one to intentionally > sabotage > the experiment in continuous proofing untrue. and a low blow to boot. i didn't "sabotage" the experiment. i was doing the one thing i was still _allowed_ to do at distributed proofreaders, i.e., proof... and i did a darn good job on every page i did. look at my "diffs", and you'll see that i spotted and corrected the errors on pages i proofed... and then look at my "no diff" pages, and you'll see i _correctly_ passed through correct pages. show me one error that i failed to catch. show me one "error" which i "injected"... no sir, as far as i know, and i would _love_it_ if someone pointed out a mistake i had made, because i _learn_ from my _mistakes_, i _do_, but as far as i know, i made _no_ mistakes on the 128+ pages which i proofed... not a one... and to imply otherwise is to tell one big fat lie. > That account and the others he has been using > have also been disabled according to Juliet Sutherland's > instruction to maintain the ban, especially given the > escalation of his behavior there from trolling > to actual sabotage. there was no :trolling". and there's been no "sabotage". these loaded words are because donovan/davie simply cannot tolerate the truth, _especially_ when it is backed with data and data and more data and more data still... -bowerbird p.s. but i knew he'd shut me out when i revealed my data, because that's what power freaks do when you reveal them, they exercise their power. of course, if i'd still cared one bit whether i can access d.p. or not, i wouldn't have cut the cord. i knew exactly what i was doing, and exactly what he would do. power freaks have buttons that are so easy to predict, it's pitiful. ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/c6411936/attachment.htm From piggy at netronome.com Fri Mar 21 04:46:27 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Fri, 21 Mar 2008 07:46:27 -0400 Subject: [gutvol-d] tesseract and ligatures In-Reply-To: <20080321083205.GC18003@ark.in-berlin.de> References: <20080321083205.GC18003@ark.in-berlin.de> Message-ID: <47E3A013.2050009@netronome.com> Ralf Stephan wrote: > BTW, since we're just at it, there's another nontriviality > involved with some scans that are out there. > > Some scans you can get have sort of shadows, just like > antialiasing which is practically impossible to remove. > This leads to characters sticking together like ligatures do. > > I thought OK why not, then let's just train tesseract for > those character groups, it will only take a bit more effort... > Result was, tesseract is not able to train ligatures, i.e. > groups of characters, at all! It's hard wired to single characters. > It *appears* at first to be able to train pairs of characters > but this is an illusion because if it's a pair of ASCII chars > tesseract won't barf because it thinks it's one UTF-8 char. > > I wonder if they even thought of multibyte UTF-8? > > Summary for me: Tesseract is unusable without ligature support. > This is a major bug. > > Uh, it handles ligatures just fine. I couldn't do Fraktur without it. It uses UTF-8 internally. For some typefaces I've done exactly what you describe--I've added "ligatures" which are just common printing defects. There's a fellow working on Kanada, and there every glyph is a ligature. If you send me your training pages and the training data you generated, I'd be happy to look through them. > Regards, > ralf > From Bowerbird at aol.com Fri Mar 21 11:25:11 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 21 Mar 2008 14:25:11 EDT Subject: [gutvol-d] happy spring everyone Message-ID: information at the website of the griffith observatory tells me it has been spring ever since wednesday night, so i'm glad i am officially caught up with the universe... it also tells me the moon was full last night, which just might help explain the little bit of lunacy that happened. as if we needed an explanation... you might remember that, earlier in the week, i told you that i intended to start taking my posts to a public blog, one that -- unlike this list -- will be crawled by google, and so gain a greater visibility i have eschewed thus far. i said i would start it on _friday_, and lo and behold, on _thursday_night_ donovan and juliet invent an excuse to "justify" a new effort to stop me from _visiting_ d.p. i certainly understand them wanting to cut me off... as long as i was just _talking_ about d.p. inefficiencies, they could try to get you to dismiss me as an old crank. but once i start _quantifying_ those inefficiencies -- 1,137 em-dashes mistakenly turned into en-dashes which then had to be corrected manually by proofers, 504 end-line hyphenates which had to be rejoined _needlessly_, 57 end-line em-dashes to be clothed -- and doing it with results from their very own research, which i can point people to view right on the d.p. site, well then that solid evidence proves i ain't just a crank. as long as i'm abstractly _talking_ about incompetence, it's one thing. but when i _show_ you their actual o.c.r.: > http://z-m-l.com/misc/paul_bad_pages.html so you can see _exactly_ how embarrassingly awful it is, and understand how wrong it is to make proofers fix that, well, it's kind of hard to sweep those facts under the rug... so i certainly understand them wanting to cut me off... they don't want you to see the truth, to see the actual data. so if i'm gonna _reveal_ it, their only option is to cut me off. *** but you nice folks on this list don't need more help from me. you now know how to view d.p. incompetence on your own... just go to the project page of a few books they're doing -- pick a handful of books, at random, to get a sample -- and step through the progression of pages like i did for file #36 of "paul and the printing press", and you'll see it for yourself. in roughly half of the d.p. books, there is an initial incompetence which is huge, then a heroic p1 job, followed by p2 catching most of the rest, and finally p3 coming in to "finish up" the job, at least more or less... in the other half of the books, there were good scans and good o.c.r., and then p1 has a _much_ easier time of it... still, the curve you obtain when you plot the errors fixed per round is uniformly down-sloping. the only difference is how high it starts at. with "paul and the printing press", there were typically an estimated 50-100 errors per page, with p1 fixing 90%-98%, then p2 getting a couple of them, and p3 doing the stragglers. (and probably missing a few.) with the "cleaner" books, p1 will catch 2-10 errors per page, p2 will catch the remaining 1 or 2, and p3 has nothing left... page after page like this, in book after book, day after day... *** oh yeah, since we're starting off a new season and all of that, it's probably a good time to make an important observation. it should be clear that i think very highly of the p1 proofers... (and it should be very clear now that they deserve our praise, rather than the disdain that they sometimes get over at d.p.) what might not be so clear is how i feel about the p3 proofers. the data i have shown hasn't been very kind to their reputation, and i've pointed out time after time their performance has been not significantly different than the p1 proofers. but let me assure you that i think _very_ highly of the p3 people. first of all, they're volunteers just like everyone else. moreover, they are the volunteers who have stepped up to the plate and said "i will be one of the people who constitutes the final line". this means they're taking responsibility for attaining perfection; if there are o.c.r. errors left in a text after they are done with it, they are the ones who will take the blame. that is _admirable_... plus, they get the toughest errors, the ones that have survived two sets of human proofer eyeballs already. the sneaky ones. all by itself, this combination of _increased_responsibility_ and _finding_the_persistent_errors_ is a difficult-enough burden... but furthermore, due to the wacky workflow d.p. has invented, where 100 p3 proofers have to do the same number of pages as _thousands_ of p1 proofers, and do 'em to a higher standard, the p3 proofers are now _exhausted_. they're badly burned out, and they're tired, and that does not make for an efficient proofer. so if the p3 proofers are letting a few errors slip by these days, let me assure them loudly and clearly that "i understand why!" they're _overworked_, and underpaid, and they need a break... a spring break... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/6be38bd7/attachment-0001.htm From richfield at telkomsa.net Fri Mar 21 06:48:12 2008 From: richfield at telkomsa.net (Jon Richfield) Date: Fri, 21 Mar 2008 15:48:12 +0200 Subject: [gutvol-d] Gothic or Gothic? attn: La Monte H.P. Yarroll Message-ID: <47E3BC9C.80006@telkomsa.net> Why, sure. Let me know in what form. I'd be happy to email you a few Omniscanned page images, or if you prefer, digitally photographed JPGs. Let me know whether you have any strong preferences; eg do you just want a sample page, or one page from each letter of the alphabet, or what? I also have a Cassel's German Dictionary from the 1950's. (G-E, E-G) The German text is in Fraktur. Do you want a couple of pages of that as well? One thing though; neither book is sacrificable, so you will have to take pot luck with page gutters etc. Unfortunately though, I cannot get down to that till first week in April. I hope that isn't a deal-breaker. Cheers, Jon >>> Subject: Re: [gutvol-d] Gothic or Gothic? Thanks folks! From: "La Monte H.P. Yarroll" Date: Thu, 20 Mar 2008 10:14:06 -0400 To: Project Gutenberg Volunteer Discussion Jon Richfield wrote: > I wrote asking for advice on scanning Fraktur. (Absent-mindedly > claiming to have Omniscan, which afaik is the one universal and > error-free scanning software package; I meant of course Omnipage, > which is not. (Not bad actually, but...)) > ..... I'm actively interested in improving tesseract OCR's fraktur performance. Could you point me at a few of your pages? I'll let you know how well we're doing with the limited fraktur training we have so far. <<< From hyphen at hyphenologist.co.uk Fri Mar 21 11:31:46 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Fri, 21 Mar 2008 18:31:46 -0000 Subject: [gutvol-d] happy spring everyone In-Reply-To: References: Message-ID: <000601c88b81$d2af6050$780e20f0$@co.uk> Bowerbird at aol.com wrote >information at the website of the griffith observatory >tells me it has been spring ever since wednesday night, Complete with snow in the northern of England where I am Brrrrrrrrrrrrrrrrrrrr. Dave F -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/b14788ba/attachment.htm From Bowerbird at aol.com Fri Mar 21 11:49:33 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 21 Mar 2008 14:49:33 EDT Subject: [gutvol-d] happy spring everyone Message-ID: dave said: ?> Complete with snow in the northern of England california is the place you want to be... ;+) -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/81dcc85e/attachment.htm From Bowerbird at aol.com Fri Mar 21 12:42:58 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 21 Mar 2008 15:42:58 EDT Subject: [gutvol-d] the google book a.p.i. Message-ID: anybody using the google book a.p.i. yet? > http://booksearch.blogspot.com/2008/03/preview-books-anywhere-with-new-google.html found a way to connect p.g. e-texts to it? -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/e27cce46/attachment.htm From j.hagerson at comcast.net Fri Mar 21 12:51:37 2008 From: j.hagerson at comcast.net (John Hagerson) Date: Fri, 21 Mar 2008 14:51:37 -0500 Subject: [gutvol-d] PG DVD project needs volunteers Message-ID: <023701c88b8c$f9d3d1b0$1f12fea9@sarek> If you would like to help us duplicate and mail out copies of the Project Gutenberg DVD, we could use your assistance. Please contact me off the list. Thank you. John Hagerson From piggy at netronome.com Fri Mar 21 15:21:12 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Fri, 21 Mar 2008 18:21:12 -0400 Subject: [gutvol-d] Gothic or Gothic? attn: La Monte H.P. Yarroll In-Reply-To: <47E3BC9C.80006@telkomsa.net> References: <47E3BC9C.80006@telkomsa.net> Message-ID: <47E434D8.3010901@netronome.com> Jon Richfield wrote: > Why, sure. Let me know in what form. I'd be happy to email you a few > Omniscanned page images, or if you prefer, digitally photographed > JPGs. Let me know whether you have any strong preferences; eg do you > just want a sample page, or one page from each letter of the alphabet, > or what? > I would recommend avoiding JPG for anything that has to go through OCR. The lossy characteristics of JPG tend to reduce the effectiveness of OCR a lot. PNG and TIFF are my preferred formats, but any open format will do. A single sample page would be a good start. If the current training does not work well, I'll ask for more. > I also have a Cassel's German Dictionary from the 1950's. (G-E, E-G) The > German text is in Fraktur. Do you want a couple of pages of that as well? > I don't think I can clear that. Thanks for the offer though. > One thing though; neither book is sacrificable, so you will have to take > pot luck with page gutters etc. > Ah, yes. Greyscale scans are much preferred over bilevel, especially if there is going to be gutter noise. > Unfortunately though, I cannot get down to that till first week in > April. I hope that isn't a deal-breaker. > I certainly hope to live much more than another two weeks. :=) > Cheers, > > Jon > > >>> > Subject: > Re: [gutvol-d] Gothic or Gothic? Thanks folks! > From: > "La Monte H.P. Yarroll" > Date: > Thu, 20 Mar 2008 10:14:06 -0400 > > To: > Project Gutenberg Volunteer Discussion > > > Jon Richfield wrote: > >> I wrote asking for advice on scanning Fraktur. (Absent-mindedly >> claiming to have Omniscan, which afaik is the one universal and >> error-free scanning software package; I meant of course Omnipage, >> which is not. (Not bad actually, but...)) >> >> > ..... > > I'm actively interested in improving tesseract OCR's fraktur > performance. Could you point me at a few of your pages? I'll let you > know how well we're doing with the limited fraktur training we have so far. > > <<< > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From donovan at abs.net Fri Mar 21 16:52:00 2008 From: donovan at abs.net (D Garcia) Date: Fri, 21 Mar 2008 19:52:00 -0400 Subject: [gutvol-d] Moderation/censorship In-Reply-To: References: Message-ID: <200803211952.00866.donovan@abs.net> On Friday 21 March 2008 05:13, Bowerbird at aol.com wrote: > donovan/david said: > > In contrast, over on DP where bowerbird > > created several sockpuppet accounts to > > bypass his unprecented ban there, > > um... he probably meant "unprecedented"... In fact I do and did. Cat fur in the keyboard, probably. > but... sockpuppet accounts? to bypass a ban? Yes, and my apologies to the DP user whose username actually *is* 'sockpuppet.' Evidence follows, but for the impatient, skip to the last quoted portion and response. > excuse me? i was explicitly _not_ prohibited > from _proofing_ when i was "banned" -- i was > only restricted from _posting_ in the forums... Since bowerbird mentions it, let's review the sum total of his known proofreading activities at DP. It's quite an enlightening view, and very relevant to the discussion. As bowerbird, 32 pages back in the years when DP had only two rounds. As bradjohnson, 3 pages, account not used in 251 days. As haroldjohnson, 4 pages, most recently a single page on March 7, 2008. As ellipsisshellipis, (interesting nick choice), 16 pages on March 7, 2008 (the date the account was created), and the 116 pages of "work" in the experiment project on March 19, 2008. This account was also used to post a poll on the DP forums. (See above where bb clearly states his belief was that he was explicity banned from posting in the forums.) As sandy claws, no pages, but a Christmas Day 2007 posting (the day the account was created.) (Again, see above where bb clearly states his belief was that he was explicity banned from posting in the forums.) Patterns, anyone? Out of all the projects available to choose from during all that time, bowerbird only managed to find *one* that piqued his interest, and it just so happened to be the one he's been ever so faithfully posting about here, in much less than flattering terms. Obviously he understood that he was banned from posting in the DP forums, and yet he used two freshly-minted accounts to do exactly that. > > he recently used one to intentionally > > sabotage > > the experiment in continuous proofing > > untrue. and a low blow to boot. See above. > i didn't "sabotage" the experiment. The people actually running the experiment at DP say differently, used far stronger language in describing his efforts in that project, and are to me far more credible as references. > i was doing the one thing i was still _allowed_ > to do at distributed proofreaders, i.e., proof... See above for evidence regarding bowerbird's obvious commitment to DP. > and i did a darn good job on every page i did. Many of our volunteers with bowerbird's level of experience with DP also believe the above statement to be true of themselves. > no sir, as far as i know, and i would _love_it_ > if someone pointed out a mistake i had made, > because i _learn_ from my _mistakes_, i _do_, > but as far as i know, i made _no_ mistakes on > the 128+ pages which i proofed... not a one... Perhaps bowerbird has chosen to learn from the wrong mistakes. Let's skip on a bit... > there was no :trolling". and there's been no "sabotage". > > these loaded words are because donovan/davie simply > cannot tolerate the truth, _especially_ when it is backed > with data and data and more data and more data still... I don't believe I've ever seen a more clear-cut example of projection. > p.s. but i knew he'd shut me out when i revealed my data, > because that's what power freaks do when you reveal them, > they exercise their power. of course, if i'd still cared one bit > whether i can access d.p. or not, i wouldn't have cut the cord. Except that you *didn't* cut the cord. Instead, you explicitly circumvented the letter *and* the spirit of your ban at DP, got caught (calamity of calamities!) and DP slapped your hands for it. Forgive me if I'm entirely unsympathetic, but you set this up yourself. The admins at DP have long been aware that you have other accounts, and as long as you only used them to read the forums, you received the benefit of the doubt. *Only* when that condition changed was the ban extended to the other accounts, *after* discussion and agreement by the admins. Your only justification in characterizing me as a "power freak" is that I delivered the message. At any rate, we've established that bowerbird doesn't care about DP anymore, and that's great news to a lot of people. I hope this means no more looking for new accounts that he has created, and that the PG volunteers on this list will no longer have to skip past his previously uninterrupted flow of infrequently relevant but always copiously quixotic posts. Have a great Easter! David From Bowerbird at aol.com Sat Mar 22 00:07:09 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 22 Mar 2008 03:07:09 EDT Subject: [gutvol-d] Moderation/censorship Message-ID: i'll have some nice long replies to donovan/david next week... :+) -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080322/701aaf2f/attachment.htm From paulmaas at airpost.net Sat Mar 22 07:02:06 2008 From: paulmaas at airpost.net (Paul Maas) Date: Sat, 22 Mar 2008 07:02:06 -0700 Subject: [gutvol-d] Moderation/censorship In-Reply-To: References: Message-ID: <1206194526.21133.1243751209@webmail.messagingengine.com> Mr. Bowerbird, to spare us your bloated email replies, why don't you instead post them in a more generic form to your esteemed blog? That way they can be Google indexed. Fair trade. What is the URL to your blog? On Sat, 22 Mar 2008 03:07:09 EDT, Bowerbird at aol.com said: > > i'll have some nice long replies to donovan/david next week... > :+) > > -bowerbird > > > > ************** > Create a Home Theater Like the Pros. Watch the video on AOL > Home. > > (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -- Paul Maas paulmaas at airpost.net -- http://www.fastmail.fm - Access your email from home and the web From Bowerbird at aol.com Sat Mar 22 11:15:09 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 22 Mar 2008 14:15:09 EDT Subject: [gutvol-d] bloated email replies Message-ID: paul said: > to spare us your bloated email replies well, it appears that paul doesn't mind hearing an unfair insult, but balks when then asked to listen to the person clearing their name... learn to use your delete key, paul. because i respond to attacks... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080322/53360a9a/attachment.htm From Bowerbird at aol.com Sat Mar 22 11:30:46 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 22 Mar 2008 14:30:46 EDT Subject: [gutvol-d] a lot of incompetence Message-ID: folks, there's a lot of incompetence over at d.p. a lot. from a lot of people. and now that i'm pointing to it so everyone can see it, rather than just making "vague" claims that it's there, it's making those incompetent people _very_ nervous. they're used to dumping their crap on the proofers -- unfairly -- and getting it back all nice and shiny. now i'm serving notice that i'm going to reveal them... so they're gonna turn up their attack machines. but i can handle their flak. i still buy my anti-flame foam -- the same kind they use on airport runways -- by the tanker-truckload... as much as possible, i'm going to stick to the _data_. but the idiots are going to try to make it _personal_... so if you don't like turbulence, buckle your seatbelts. -bowerbird p.s. and yes, i will also post the data to a blog, and prevent the idiots from commenting there, so if you _really_ want peace and quiet, just read it there... but if they wanna fight here, i _will_ fight 'em here. ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080322/c247ff7a/attachment.htm From Bowerbird at aol.com Sat Mar 22 11:41:32 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 22 Mar 2008 14:41:32 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 06 Message-ID: speaking of data... more info on the parallel test of "paul and the printing press"... once again, looking for an easy-to-compute "metric" of quality. yesterday we noted the percentage of lines changed on a page to give us a "metric" about the quality of the page before and after... today we get the percentage of _pages_ changed in a _round_ to give us a "metric" we can use to determine quality of the round... (and you're right, such a "round metric" is of absolutely _no_use_ in the promulgation of a "roundless" system, but piggy wants one anyway, so let's try to give piggy what he wants, make him happy.) again, you can follow along by looking at the actual data at d.p.: > http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea this parallel proofing was the one in the normal p1-p2-p3 workflow. we're gonna focus right now on the changes made in the p1 round... specifically, we're gonna count the number of pages changed by p1. out of 244 pages, the only "no diff" ones were the 15 blank pages, with 1 exception (#140), which was "no diff" because the proofer missed the 2 errors on it. (he also "forgot" to rejoin hyphenates, _and_ to place blank lines between paragraphs, which indicates to me that this proofer just plain neglected to proof this page.) anyway, for the record, here are the two hard-core errors: > It was this " echoing idea" that was new to <- floating quotemark > of a monologue,-an exact reversal of his policy. <-s/b em-dash oh, and by the way, the second parallel proofing caught the latter of those two errors, but it missed the former... but as you know, floating quotemarks are autodetectable. ok, so out of 229 non-blank pages, _228_ were changed by p1... that's a very high percentage, reflecting how rotten the o.c.r. was. when you have good scans and good o.c.r., 25% to 75% of the pages in a book can be recognized perfectly by the o.c.r. app, especially when it is supplemented by a good clean-up tool... *** while we're here, let's dig a little bit deeper into this data, ok? especially in a way that will give us a _page-quality_ metric. (because remember, that's what this mission was all about.) first, toss the 15 blank pages. not hard to proof them. of the remaining 229 pages, what we have left is this... > 73 pages with a p2 "no diff" and a p3 "no diff" following it. > 118 pages where p2 had a "diff", and p3 "no diff" after that. > 24 pages where p2 had a "diff" and p3 had a "diff" as well. > 14 pages where p2 had "no diff", but p3 _did_ have a "diff". ok, so let's do a closer analysis of these one-by-one... 73 pages with a p2 "no diff" and a p3 "no diff" following it. these are the pages which p1 proofers took to perfection, often after having made _many_ changes to inferior o.c.r. considering that some of these pages required _dozens_ of type-in corrections, this 32% perfection rate is _great_. 118 pages where p2 had a "diff", and p3 "no diff" after that. these are the 52% of the pages which p2 took to perfection, usually by catching the occasional errors p1 had missed... note that after 2 rounds, 84% of the pages were _perfect_. 24 pages where p2 had a "diff" and p3 had a "diff" as well. these 10% of the pages are ones we presume p3 perfected, p2 fixed _some_ errors on these pages, but p3 got the rest. so these pages took the combined efforts of p1, p2 and p3. 14 pages where p2 had "no diff", but p3 _did_ have a "diff". on these 6% of the pages, p2 was asleep, but p3 covered; (but, in fairness to p2, half the changes were "ticky-tack".) *** so once again, we get the pattern i've discussed all along, the pattern that seems to capture a "common-sense" take, which is that p1 fixes most of the errors, p2 gets most of the remaining ones, and p3 comes in and does clean-up. sure enough, p1 did _awesome_, converting _rotten_ o.c.r. into great pages, including an amazing 32% perfection rate. p2 did well, taking another 52% of the pages to perfection all by themselves, and 10% more in conjunction with p3... p3 had to clean up the final 5% of the pages -- just 14 -- and on half of those, the changes they made were minor. again, this is the pattern you get on page after page, in book after book, day after day, over in d.p.-land... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080322/3dccd85f/attachment-0001.htm From hart at pglaf.org Sun Mar 23 13:08:56 2008 From: hart at pglaf.org (Michael Hart) Date: Sun, 23 Mar 2008 13:08:56 -0700 (PDT) Subject: [gutvol-d] Unexpected Events Message-ID: I'm doing a survey on events of the last 5-10 years that you did NOT expect. In that perspective, I would also be interested in hearing your predictions for the next 5-10 years of events that YOU think might happen that would NOT be expected by the general population. Thanks!!! Michael S. Hart Founder Project Gutenberg From ricardofdiogo at gmail.com Sun Mar 23 13:53:30 2008 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Sun, 23 Mar 2008 20:53:30 +0000 Subject: [gutvol-d] Unexpected Events In-Reply-To: References: Message-ID: <9c6138c50803231353y3da0e7f3g3059cee1cc14d32b@mail.gmail.com> 2008/3/23, Michael Hart : > > > In that perspective, I would also be interested in > hearing your predictions for the next 5-10 years of > events that YOU think might happen that would NOT > be expected by the general population. > In the next 5-10 years: * most people in the western world will have an ebook reading device; * they'll try to create an international cyberpolice for tracking down everything you download; * copyright is going to change drastically: a foundation will be created for managing and paying royalties to the authors, ebooks will be populated with adds, more and more content will be available for free; Ricardo From Bowerbird at aol.com Mon Mar 24 10:15:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 24 Mar 2008 13:15:28 EDT Subject: [gutvol-d] the myth of the elusive error only the expert proofer can catch Message-ID: some people have this notion that there are "elusive errors" that "only an expert proofer" can spot. this belief is a myth. some people are certainly _better_ proofers than other folks. a few individuals might even be relatively good enough that we could reasonably consider 'em to be "experts" at proofing. (but they seem to have been _born_ with the skill, rather than having "learned" it, though experience does make it sharper.) however, the flip side -- the error which is so elusive that _only_ the "expert proofer" can find it -- has no evidence that i can see. some people have even asserted that _ten_rounds_ of "novice" proofers might miss one of these "elusive errors", which would then be spotted in a _single_ pass by _one_ expert proofer... bull crap. at least, _i_ have never seen that happen. and, try as hard as i might, i can't even _imagine_ what such an "elusive error" might be, what it would look like, how it can hide. and i'm one of the (unlucky?) people who can't not notice typos, because they jump out and stick me in the eyeball with a pencil. so i am quite sure that i'm not "blind" to these "elusive" errors... oh, don't get me wrong. i've seen _plenty_ of errors that have managed to escape detection from one, two, three, even _four_ rounds of proofing. heck, in the "planet strappers" experiment, there was an error that went unfixed by _five_ proofing rounds. what was that error? it was a _comma_, smack dab in the middle of a sentence, big as day, there for anyone and everyone to see... anyone reading the page would know that comma didn't belong. it hardly qualified as something that people would call "elusive"... then why was it missed? for the very same reason that sometimes you will get 10 coin-flips in a row all coming up "tails" -- _chance_. another error that survived for many rounds was one that simple _spellcheck_ would detect. how did it last so long? don't ask me. no, there is _no_ error that is so "elusive" that 100% of "novices" will miss it and 100% of the "expert proofers" will locate it. none. and if you want to maintain that there is, let's put it to the test. of the _self-selected_ lot who proof at distributed proofreaders, and thus get lots of experience, i'd say the "best" proofers catch about 80-95% of the errors, and the "ordinary" ones get 65-85%. so we're basically seeing many shades of gray, not black-or-white. -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080324/6235827f/attachment.htm From ralf at ark.in-berlin.de Sat Mar 22 01:19:41 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Sat, 22 Mar 2008 09:19:41 +0100 Subject: [gutvol-d] tesseract and ligatures In-Reply-To: <20080321083205.GC18003@ark.in-berlin.de> References: <20080321083205.GC18003@ark.in-berlin.de> Message-ID: <20080322081941.GA5299@ark.in-berlin.de> me wrote > Summary for me: Tesseract is unusable without ligature support. > This is a major bug. This applies to the SVN version (154, head) only, as I just found out, so hands off that. The official version 2.01 appears to do better, but I'm still testing. Sorry for shouting first, ralf From Bowerbird at aol.com Mon Mar 24 14:36:55 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 24 Mar 2008 17:36:55 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 07 Message-ID: here's elaboration on the data i presented on saturday, again on the parallel test of "paul and the printing press". this view is the best look at this data-set yet... and it's derived using info that's presently available to d.p. right on its own "project page" for this book: >?? http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea (which means that any book in their system could be subjected to this same analysis, at any time you want.) *** but first, correction of a minor error i made: in discussion of a metric for the rotten o.c.r., i checked the number of o.c.r. pages changed. i said the normal p1-p2-p3 workflow had had just 1 page where p1 had a "no diff" from o.c.r., and even then there were 2 errors on that page, and that the parallel p1 had caught one of them. in actuality, it was the _parallel_ p1 who had a "no diff" on that page, as the _normal_ p1 had found and fixed _both_ of those errors. normal p1 had a "diff" on _all_ non-blank pages. again, you have some pretty awful o.c.r. when it can't get even _1_ page perfect out of 200+. *** ok, now on to a closer view of the data i presented saturday... in this view, i show the different types of progression through the rounds, starting with those that would benefit most from another round of full proofing, or a changes-only verification. i list the actual pages that fall into each type of "progression"... by "progression", i mean separation of each page into "types", where the type reflects the rounds making a change to a page. a pair of asterisks indicates no change was made in that round, so -- for instance -- the progression-type of p1-**-p3 means that p1 changed the page, p2 had a "no diff", and p3 changed it. the progression-types i found were: -> p1-p2-p3 -- 22 pages -- (every round made a change) -> p1-**-p3 -- 14 pages -- (p1 and p3 made a change) -> **-p2-p3 -- 1 page -- (p2 and p3 made a change) -> p1-p2-** -- 116 pages -- (p2 made the last change to the page) -> p1-**-** -- 76 pages -- (p1 made the last change to the page) -> **-**-** -- 15 pages -- (all these no-change pages were _blank_) the pages that comprise each type of progression are listed below... again, you can follow along if you like, by viewing the project page: >?? http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea *** these pages could benefit from another round... -> progression-type p1-p2-p3 (all 3 rounds made changes -- 22 pages -- could use another _full_ proofing round) 8, 18, 22, 23, 36, 38, 65, 81, 90, 96, 111, 144, 154, 155, 168, 196, 205, 209, 212, 225, 227, 237 -> progression-type p1-**-p3 (p2 "no diff", but p3 "diff -- 14 pages -- could use a changes-only verification round) 7, 26, 47, 70, 75, 82, 89, 108, 160, 197, 199, 203, 206, 238 this progression-type is the most troubling. we'd _prefer_ to believe that -- when a page encounters a "no diff" experience -- it's because it's clean. but the fact is that these pages were "no diff" in p2, yet p3 made a change. it might be that we need to have _two_ proofers verify that a page is clean, but that would mean a significant increase in the amount of work required. so we need to take a closer look on what's going on in these cases... (and i do that below...) *** from here on down, i'd say that none of these pages need more verification... -> progression-type **-p2-p3 (meaningless changes on a forward-matter page --1 page -- can be ignored) 2 -> progression-type p1-p2-** (no changes after p2 took it to perfection -- 116 pages -- so verified once) 12, 17, 19, 20, 25, 28, 29, 33, 35, 39, 42, 43, 45, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 63, 66, 68, 72, 74, 77, 78, 79, 80, 84, 92, 94, 95, 107, 114, 117, 119, 120, 127, 128, 130, 131, 132, 133, 134, 135, 138, 140, 147, 150, 153, 156, 158, 161, 162, 164, 165, 171, 173, 174, 176, 177, 178, 179, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 198, 200, 201, 202, 204, 207, 208, 210, 211, 213, 214, 215, 216, 217, 219, 220, 221, 222, 223, 224, 226, 228, 229, 230, 231, 232, 233, 234, 235, 236 *** -> progression-type p1-**-** (no changes after p1 took it to perfection -- 76 pages -- so verified _twice _) 5, 6, 10, 14, 16, 21, 24, 30, 31, 32, 34, 37, 40, 41, 44, 61, 62, 64, 67, 69, 71, 73, 76, 83, 85, 86, 87, 88, 91, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 109, 110, 112, 113, 115, 116, 118, 121, 122, 123, 124, 125, 126, 129, 136, 137, 139, 141, 142, 143, 145, 146, 148, 149, 151, 152, 159, 163, 166, 167, 169, 170, 172, 175, 180, 218, 239 no comment necessary here. p1 did an _excellent_ job, transforming some rotten o.c.r. into 76 pages that were _perfect_ according to later proofers... one might argue these pages, "no diff" by p2, could've been _skipped_ by p3, but of course there was the risk they were _actually_ in the p1-**-p3 type... so we'll need to learn what happened with that type before we suggest that. *** -> progression-type **-**-** (no changes at all, meaning o.c.r. got it right, 15 pages, all blank) 1, 3, 4, 9, 11, 13, 15, 27, 93, 157, 240, 241, 242, 243, 244 blank pages were the only pages which tesseract recognized correctly... *** so once again, we get the pattern i've discussed all along, the pattern that seems to capture a "common-sense" take, which is that p1 fixes most of the errors, p2 gets most of the remaining ones, and p3 comes in and does clean-up. the only puzzling pages here were those 14 pages where p2 had a "no diff" on the page, but p3 did make a change. so i took a closer look at those pages... *** of these 14 pages in p1-**-p3, 12 were not troublesome: 2=errors that don't have any significance in this analysis; 4=changes that were concerned with end-line hyphenates; 4=errors that could've been detected with pre-processing; 2=correct recognition (might or might not be p-book errors). i've appended the actual text from these changed line-pairs, with my explicit categorization after the line-pairs. this left a mere _2_errors_ that were actual, troubling errors. on file#160: > Paul lingered the bill nervously. Fifty dollars! > Paul fingered the bill nervously. Fifty dollars! and on file#206: > bid good-by to the familiar balls of the school, > bid good-by to the familiar halls of the school, *** this allows us to comment on the suggestion up above that we could've skipped p3 on the p1-**-** pages because the p2 "no diff" acted as a "verification" that the page was clean. this means that -- if we would have skipped p3 on all of the 90 pages where p2 made no change -- that decision would have allowed _2_ errors to pass into this book of 200+ pages. and it's safe to say both would be caught by the general public. so saving the additional round of proofing on those 90 pages would seem -- to me -- to have been a good trade in this case. this is _not_ an argument that a single "no diff" is "good enough" to stop proofing a page... i would think that most people would hold the opinion it takes _2_ "no diff" rounds to be _confident_... but once again, that depends on _how_good_ is "good enough". oh, and by the way... maybe you are a person -- i know some are out there! -- who is thinking "2 errors! we can't tolerate 2 errors in a book! 2 many!" get real, buddy... because the normal d.p. workflow?, the one with the p3 "marines"? it left _more_than_ 2 errors in this book, no matter how we count, and i will give you a list of some of their specific errors tomorrow... so if you really want to have books that accurate, you will need to convince the people at d.p. to change over to 4 proofing rounds. or maybe _5_. either way, good luck with _that_... you'll need it... *** one more thing... the parallel p1 proofers _corrected_ both "lingered" and "balls"... that's right, some lowly p1 proofers caught the 2 real errors that both p2 and p3 proofers missed. kinda makes you wonder, eh? so evidently those weren't the mythological "elusive errors"... especially beings the expert proofers missed them... ;+) -bowerbird p.s. here is the listing of the 14 cases, with their analysis, where p2 had a "no diff" on the page, but p3 came and made a change. again, categorization of these cases follows at the very bottom... #7 > Copyright, 1920 > Copyright, 1920, meaningless comma. #26 > "The March Hare!" he repeated wlth enthusiasm. > "The March Hare!" he repeated with enthusiasm. bad word would be caught by spellcheck. #47 > and the Sanscrit Vedas would have been > and the Sanscrit[**typo? Sanskrit] Vedas would have been recognized as it was printed in the p-book; not an o.c.r. error. #70 > I have already explained, care much for reading; > have already explained, care much for reading; the word "i" was doubled from the previous line, so it was _autodetectable_ as a repeated word. #75 > ways at liberty to send contributions back with > at liberty to send contributions back with improper joining of "always" on line above, so wouldn't have happened with a good workflow. #82 > various sources one number after another of ` > various sources one number after another of garbage character should've been eliminated in preprocessing. #89 > and Diamonds for the more prosperous ` > and Diamonds for the more prosperous garbage character should've been eliminated in preprocessing. #108 > their own idle pleasure but to financing Gutenburg's > their own idle pleasure but to financing Gutenburg's[**typo? Gutenberg's] recognized as it was printed, consistently, in the p-book; not an o.c.r. error. #160 > Paul lingered the bill nervously. Fifty dollars! > Paul fingered the bill nervously. Fifty dollars! actual error, and a stealth scanno to boot. you can't win 'em all... #197 blank line introduced between paragraphs. outside the scope of this analysis. #199 > "Pretty nearly," returned Mr. Hawley good-naturedly. > "Pretty nearly," returned Mr. Hawley good-*naturedly. stupid asterisk note on a questionable hyphenation. not an error. #203 > the cardboard. The thickness of these semi-cylindrical > the cardboard. The thickness of these semi-*cylindrical stupid asterisk note on a questionable hyphenation. not an error. #206 > bid good-by to the familiar balls of the school, > bid good-by to the familiar halls of the school, actual error, and a stealth scanno to boot. you can't win 'em all... #238 > when weary, sleepy, but triumphant, a half jubilant, > when weary, sleepy, but triumphant, a half-jubilant, improper rejoining of end-line hyphenate... *** here's my categorization of the errors on these 14 pages: errors that don't have any significance in this analysis: #007> meaningless comma on a front-matter page. #197> blank line between paragraphs, not in our scope. changes that were concerned with end-line hyphenates: #075> improper rejoining of end-line hyphenate. #238> improper rejoining of end-line hyphenate. #199> asterisk on de-hyphenation. not o.c.r. error. #203> asterisk on de-hyphenation. not o.c.r. an error. errors that could've been detected with pre-processing: #026> bad word could've been caught by spellcheck. #070> doubled-up word could've been autodetected. #082> garbage character could've been autodetected. #089> garbage character could've been autodetected. correct recognition (might or might not be p-book errors): #047> recognized as is in the p-book; not o.c.r. error. #108> recognized as is in the p-book; not o.c.r. error. actual errors that matter: #160> actual error; stealth scanno too; can't win 'em all... #206> actual error; stealth scanno too; can't win 'em all... ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080324/e200579c/attachment.htm From Bowerbird at aol.com Mon Mar 24 23:34:16 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 25 Mar 2008 02:34:16 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 08 Message-ID: ok, we need a quick reminder before this next series of data about the parallel proofing of "paul and the printing press"... in this series, i'll be quantifying the work required to correct the rotten o.c.r., over and above that which _would've_been_ required had a better (i.e., efficient) workflow been observed. it's important to remember that bad scans and bad o.c.r. are _endemic_ over at distributed proofreaders. because of that, it would be shortsighted indeed to blame piggy himself for the awful o.c.r. with which the proofers were faced. of course he was responsible for the bad page-scans which he created, and for the even-more-flawed decision to use _tesseract_, but there are dozens of content providers at d.p. (maybe hundreds) making equally questionable decisions ramifying in poor quality and wasting the time and energy of the well-intended proofers... therefore, the _real_ incompetence is not located at _that_ level, but the level _above_, which allows this bad work to be tolerated. somebody should have -- as suggested earlier -- been taking piggy aside, quietly instructing him that such a poor quality of o.c.r. is not permitted, informing him how to do the job better, and giving him a pat on the back and sending him back to work. since he didn't realize this himself, somebody needed to tell him. it's just that simple. *** now, in order to quantify the poor showing of tesseract, i re-did the o.c.r. on the scans with abbyy, the acknowledged o.c.r. leader. even though it's _clear_ that the tesseract output is _quite_bad_, quantifying it will better illustrate how much energy it's wasting... doing the o.c.r. was easy. of somewhat more -- manual -- work was synching the two sets of o.c.r. so that we can _compare_ 'em. so here we have the original o.c.r., from tesseract: > http://z-m-l.com/go/paulp/paul-tesseract.html and here we have the new o.c.r., from finereader: > http://z-m-l.com/go/paulp/paul-abbyy.html if you load those two pages into two windows using your browser, you can compare them straight across, now that i've synched them. this comparison reveals there is no comparison between them... but, if you want numbers, they _are_ more alike than different, with roughly 4000 lines in common, and 3000 lines that differ. but still, even a quick glance reveals the abbyy output is better. later today or tomorrow, i will give some statistics to back it up, but it is clearly observable even in an "eyeball" test like this that abby gets a _lot_ more right than tesseract, by a _large_ margin. equally clear is that _some_ of the pages need to be re-scanned, because even abbyy was unable to deal with their "gutter noise"... now last week, there was some discussion of "unpaper", which might have led some people to believe that the problems with the "gutter noise" were _unavoidable_, which is _unfortunate_, because the pattern of affected pages indicates that this was a _human_ problem. very simply put, insufficient care was taken. those scans didn't need to be "cleaned up". they needed to be _redone_. period. when those pages are re-scanned, they will deliver quality o.c.r. (at least if they're treated by abbyy, which is what i recommend.) again, this is the kind of thing for which you take a person aside, quietly inform them that this level of quality is unacceptable, and then give them a pat on the back and send them back to work... and since nobody did that in this case, i have done it here now. *** scans that can be better if you take sufficient care are unacceptable. and o.c.r. performed with a beta o.c.r. program is fully intolerable... and every single content provider at d.p. should know these things. *** once again, this is _not_ a reflection on, or a criticism of, piggy. i'm sure he's very nice, a good father to his children, and so on. and the fact that he offered up these projects of his for analysis means that he's willing to learn, which is a very admirable trait... so even though he's a step above the average volunteer at d.p. -- in the sense that he's taking on these additional missions -- and thus is probably more likely to be one who _takes_people_ aside, rather than being taken aside himself, the clear lesson is that some of the "step-above" volunteers need to up their game. the people at the bottom of the pile seem to be doing excellent. they are taking shit and turning it into shinola -- no small task... if only the people at the _front_ end of the workflow would stop "injecting" so many "errors" into the text before proofers get it... *** and one more thing, while i'm at it... i spent literally _years_ here complaining about d.p. inefficiency. for a very long time, i resisted getting overly specific, because i wanted to give the d.p. people the opportunity to "save face". they squandered this opportunity, using the flexibility i gave them to lash out at me personally, rather than to clean up their own act. this indicates how morally bankrupt their position remains today... believe me, i could have offered up book after book after book as examples of the poor workflow. at any time. and i still can do that. as a social scientist, i know the power of data, and i know it well... i held it in reserve because i know its power, but the _inaction_ of the d.p. "leadership" to correct its flaws, coupled with their clumsy attempts to silence my legitimate charges, now gives me no choice. i'm using _these_ books only because d.p. picked them out itself... and now that i've begun, i'm going to _finish_ the job, completely... i know many of you are tired of these books, and fatigued by data, but i'm gonna continue posting until i've completed the analyses, because down the line i will be referring back to this _solid_data_ whenever i repeat claims about awful d.p. workflow inefficiency... and from now on, i won't be giving d.p. any wiggle room at all... they need to fix their workflow, and they need to start that now. -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/7b16b324/attachment-0001.htm From schultzk at uni-trier.de Tue Mar 25 02:07:33 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Tue, 25 Mar 2008 10:07:33 +0100 Subject: [gutvol-d] Moderation/censorship In-Reply-To: <200803211952.00866.donovan@abs.net> References: <200803211952.00866.donovan@abs.net> Message-ID: <025F131E-C271-4B6D-BBA4-E4858F3829AB@uni-trier.de> Hi David, I have big problems with your accusations. You mentioned that Bowerbird evidently sabotaged the experiment. What I would like to know is in what way? Or is it that due to his work in connection with the experiment did give you the expected results? If so there are rules for eliminating such anomalies. I do know what I am talking about. I have the feeling that the experiment did go as you expected and have found that do to BB work the results ended up the way they are. If so either: 1) your hypothesis is wrong 2) you can safely remove BB work as an outlier I would love to scrutinize your academic experiment, but I am sure you would not like the result. Anyway, regards Keith J. Schultz Am 22.03.2008 um 00:52 schrieb D Garcia: > On Friday 21 March 2008 05:13, Bowerbird at aol.com wrote: >> [snip, snip] > > Since bowerbird mentions it, let's review the sum total of his known > proofreading activities at DP. It's quite an enlightening view, and > very > relevant to the discussion. > > As bowerbird, 32 pages back in the years when DP had only two rounds. > > As bradjohnson, 3 pages, account not used in 251 days. > > As haroldjohnson, 4 pages, most recently a single page on March 7, > 2008. > > As ellipsisshellipis, (interesting nick choice), 16 pages on March > 7, 2008 > (the date the account was created), and the 116 pages of "work" in the > experiment project on March 19, 2008. This account was also used to > post a > poll on the DP forums. (See above where bb clearly states his > belief was that > he was explicity banned from posting in the forums.) > > As sandy claws, no pages, but a Christmas Day 2007 posting (the day > the > account was created.) (Again, see above where bb clearly states his > belief > was that he was explicity banned from posting in the forums.) > > Patterns, anyone? > > Out of all the projects available to choose from during all that time, > bowerbird only managed to find *one* that piqued his interest, and > it just so > happened to be the one he's been ever so faithfully posting about > here, in > much less than flattering terms. > > Obviously he understood that he was banned from posting in the DP > forums, and > yet he used two freshly-minted accounts to do exactly that. > >>> he recently used one to intentionally >>> sabotage >>> the experiment in continuous proofing >> >> untrue. and a low blow to boot. > > See above. > >> i didn't "sabotage" the experiment. > > The people actually running the experiment at DP say differently, > used far > stronger language in describing his efforts in that project, and > are to me > far more credible as references. > >> i was doing the one thing i was still _allowed_ >> to do at distributed proofreaders, i.e., proof... > > See above for evidence regarding bowerbird's obvious commitment to DP. > >> and i did a darn good job on every page i did. > > Many of our volunteers with bowerbird's level of experience with DP > also > believe the above statement to be true of themselves. > >> no sir, as far as i know, and i would _love_it_ >> if someone pointed out a mistake i had made, >> because i _learn_ from my _mistakes_, i _do_, >> but as far as i know, i made _no_ mistakes on >> the 128+ pages which i proofed... not a one... > > Perhaps bowerbird has chosen to learn from the wrong mistakes. > Let's skip on a bit... > >> there was no :trolling". and there's been no "sabotage". >> >> From julio.reis at tintazul.com.pt Tue Mar 25 06:22:46 2008 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis) Date: Tue, 25 Mar 2008 13:22:46 +0000 Subject: [gutvol-d] Unexpected Events In-Reply-To: References: Message-ID: <1206451366.12118.128.camel@abetarda.mshome.net> Unexpected Events in 5 to 10 years: the end of the world. On Thursday 23 April 2015, UNESCO World Book and Copyright Day, Canada, the European Union and about 70 other countries repudiated the Berne Convention and changed copyright law to publication+50. The so-called Laval Convention (after the city in Qu?bec where the treaty was signed) was met with much controversy, being supported by cultural organisations both grass-roots and otherwise (like the UNESCO), and meeting strong opposition particularly from the White House, the Kremlin and the Australian government, and a huge coalition of media industry giants. The following transition applied in the European Union: On 1 Jan... PD includes books: -------------------------------- 2016 author died 1945; end of all national "special cases" 2017 published 1946 2018 published 1949 2019 published 1952 2020 published 1955 2021 published 1958 2022 published 1961 2023 published 1964 2024 published 1967 2025 published 1970 2026 published 1973 2027 published 1976 On 1 Jan 2016, all books published until 1965 became public domain in Canada, and in many nations which formerly followed a life+50 copyright law, like Angola, Chile, all North African countries and New Zealand. The end of Crown Copyright was considered one of the most surprising events in Europe. In fact, King Arthur II is rumoured to having been preparing the Laval Convention since his accession to the throne in November 2008. His was also the proposal that a city in Qu?bec be chosen to sign the Convention, a declaration which stirred some conservative sectors of US politics. The United States Ambassador in London, Jon Huntsman, Jr. called it "a cultural provocation right at our doorstep." This declaration and the following remark that such a treaty was better suited to being signed "in some forsaken Bulgarian village" cost him his seat. Australia also surprised the world two years later by demarcating itself from US Copyright policies, which she had been following for the previous years, and on 23 April 2018 aligned itself with Canada/EU. It used the same transition as the EU for the 2019-2027 period. (The US would change its copyright laws, but not until 2026, so that unexpected event is out of our 5-10 year horizon.) J?lio. From piggy at netronome.com Tue Mar 25 08:08:32 2008 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Tue, 25 Mar 2008 11:08:32 -0400 Subject: [gutvol-d] Moderation/censorship In-Reply-To: <025F131E-C271-4B6D-BBA4-E4858F3829AB@uni-trier.de> References: <200803211952.00866.donovan@abs.net> <025F131E-C271-4B6D-BBA4-E4858F3829AB@uni-trier.de> Message-ID: <47E91570.5040107@netronome.com> Schultz Keith J. wrote: > Hi David, > > I have big problems with your accusations. > You mentioned that Bowerbird evidently sabotaged > the experiment. What I would like to know is > in what way? Or is it that due to his work in > connection with the experiment did give you the > expected results? > > If so there are rules for eliminating such > anomalies. I do know what I am talking about. > > I have the feeling that the experiment did go as > you expected and have found that do to BB work > the results ended up the way they are. > > If so either: > 1) your hypothesis is wrong > 2) you can safely remove BB work as an outlier > > I would love to scrutinize your academic experiment, but > I am sure you would not like the result. > Most of the raw data is available directly to the public: http://www.pgdp.net/wiki/Confidence_in_Page_analysis#Perpetual_P1 Proofer identities are the only protected data. I suggest using check-ins of regular period to make short-range determinations of "same proofer". I think I have adequately demonstrated my willingness to accept formal analysis from all interested volunteers. Yes, the anomalies under discussion hardly spell the end of the experiment. I will even go so far as to say that "sabotage" is too strong a term. What irritated me most is that overenthusiastic participation forced me to do data analysis I hoped to postpone. I was obliged to statistically check the claim that the pages in question were edited offline. I congratulate the proofer in question for their steadily improving skill. In I3 they found defects (wa/w) at 1/5th the rate of all other proofers combined. In I4 they found defects at 1/3rd the rate of all other proofers combined. I see no value in participating in the expected thread on the meaning and validity of these statistics. Yes, there are problems with the wa/w metric. We're working to address them. > Anyway, regards > Keith J. Schultz > > > Am 22.03.2008 um 00:52 schrieb D Garcia: > >> On Friday 21 March 2008 05:13, Bowerbird at aol.com wrote: >> > [snip, snip] > > >> Since bowerbird mentions it, let's review the sum total of his known >> proofreading activities at DP. It's quite an enlightening view, and >> very >> relevant to the discussion. >> >> As bowerbird, 32 pages back in the years when DP had only two rounds. >> >> As bradjohnson, 3 pages, account not used in 251 days. >> >> As haroldjohnson, 4 pages, most recently a single page on March 7, >> 2008. >> >> As ellipsisshellipis, (interesting nick choice), 16 pages on March >> 7, 2008 >> (the date the account was created), and the 116 pages of "work" in the >> experiment project on March 19, 2008. This account was also used to >> post a >> poll on the DP forums. (See above where bb clearly states his >> belief was that >> he was explicity banned from posting in the forums.) >> >> As sandy claws, no pages, but a Christmas Day 2007 posting (the day >> the >> account was created.) (Again, see above where bb clearly states his >> belief >> was that he was explicity banned from posting in the forums.) >> >> Patterns, anyone? >> >> Out of all the projects available to choose from during all that time, >> bowerbird only managed to find *one* that piqued his interest, and >> it just so >> happened to be the one he's been ever so faithfully posting about >> here, in >> much less than flattering terms. >> >> Obviously he understood that he was banned from posting in the DP >> forums, and >> yet he used two freshly-minted accounts to do exactly that. >> >> >>>> he recently used one to intentionally >>>> sabotage >>>> the experiment in continuous proofing >>>> >>> untrue. and a low blow to boot. >>> >> See above. >> >> >>> i didn't "sabotage" the experiment. >>> >> The people actually running the experiment at DP say differently, >> used far >> stronger language in describing his efforts in that project, and >> are to me >> far more credible as references. >> >> >>> i was doing the one thing i was still _allowed_ >>> to do at distributed proofreaders, i.e., proof... >>> >> See above for evidence regarding bowerbird's obvious commitment to DP. >> >> >>> and i did a darn good job on every page i did. >>> >> Many of our volunteers with bowerbird's level of experience with DP >> also >> believe the above statement to be true of themselves. >> >> >>> no sir, as far as i know, and i would _love_it_ >>> if someone pointed out a mistake i had made, >>> because i _learn_ from my _mistakes_, i _do_, >>> but as far as i know, i made _no_ mistakes on >>> the 128+ pages which i proofed... not a one... >>> >> Perhaps bowerbird has chosen to learn from the wrong mistakes. >> Let's skip on a bit... >> >> >>> there was no :trolling". and there's been no "sabotage". >>> >>> >>> > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From Bowerbird at aol.com Tue Mar 25 10:17:51 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 25 Mar 2008 13:17:51 EDT Subject: [gutvol-d] moderation/censorship (let us celebrate because p.g. has renounced them!) Message-ID: piggy said: > the anomalies under discussion hardly spell the end of the experiment. yes, they "hardly" do. in fact, since i essentially "cycled through" pages on which there were no errors present, i _sped_up_ the experiment, moving the text along (as best as i could) to the state of "perfect", so that "error injection" -- if it's going to happen -- can _begin_... i intentionally avoided any page that had an outstanding error on it -- even though i'd identified all such pages and could've fixed them -- as i was as curious as anyone else to see if the next proofer caught it, cheering on the proofers with sharp eyes, and booing ones who missed the error _again_ (crap, now we will have to do a whole 'nother round)... and, on another level, i was showing you the _answer_ to your question. you were asking what would happen if you recycled a text "perpetually". the answer is that _someone_ -- in this case it was me -- will eventually decide to analyze _the_entire_text_ and identify the outstanding errors, and -- if they are allowed to do so -- go in and fix them all in one shot. even as it was, one of the proofers called you on the ellipse problem, noting that most of the changes being made had devolved to ellipses, which is why you had to say "ok, from now on, you can ignore ellipses". proofers will not put up with the page-by-page straitjacket for long... especially when a single button-click gets the text of the whole book. > I will even go so far as to say that "sabotage" is too strong a term. it's not just "too strong". it's headed in the completely wrong direction. > What irritated me most is that overenthusiastic participation that's quite a euphemism for a dedicated proofer... > forced me to do data analysis I hoped to postpone. evidently you're not reading my posts, because my messages have shown that i know _exactly_and_precisely_ what's been happening with that text. i know who fixed what, when, and where the remaining errors are... my analyses allow me to see things clearly that yours will _never_ reveal. > I was obliged to statistically check the claim that > the pages in question were edited offline. only because you aren't keeping up... otherwise, you would have known my work was excellent. besides, doing a comparison of iteration#4 with iteration#5 is a 5-minute operation. if you didn't want to do it, you could've asked me, and i would have given you a complete report on it... except for the "[**intentional]" tags on that one page, all of my changes revolved around the elimination of spacey ellipses... and i was the one who _found_ the third "error" in that paragraph that allowed me to make the judgment that they were _intentional_, so i _deserved_ to make that change. > I see no value in participating in the expected thread > on the meaning and validity of these statistics. the thread that says there is no meaning or value to your statistics? i don't expect it will be a very long thread, as i've just summed it up. -bowerbird p.s. and keith, thanks for defending me. but i can do it myself. with one hand tied behind my back. these guys have no punch. if you're expressing yourself, then fine, by all means, continue... but if you're doing it to "help" me, save your energy for later on. ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/fbf5be9b/attachment.htm From Bowerbird at aol.com Tue Mar 25 10:38:47 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 25 Mar 2008 13:38:47 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 08 Message-ID: here's a web-page with about 1771 clear differences between the output from tesseract and the output from abby finereader on the d.p. parallel experiment on "paul and the printing press". the top line in each pair is from tesseract, and the bottom line is from abbyy finereader. (there are more differences than this, but these are the ones that are _very_sharp...) here you can see again, in detail, the superiority of finereader over tesseract. why waste proofer eyeballs with inferior o.c.r.? also notice that some pages have a large number of cases here, where characters on the left edge of the scan were contaminated by "noise" from pages which were scanned using insufficient care. again, why waste proofer time and energy on poorly-scanned pages? *** also of note: i improved the synch on the two sets of o.c.r. so if you downloaded these files before, please do it again. here we have the original o.c.r., from tesseract: > http://z-m-l.com/go/paulp/paul-tesseract.html and here we have the new o.c.r., from finereader: > http://z-m-l.com/go/paulp/paul-abbyy.html -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/f235783a/attachment-0001.htm From Bowerbird at aol.com Tue Mar 25 12:03:44 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 25 Mar 2008 15:03:44 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 08 Message-ID: oops! forgot to include the u.r.l. for that comparison-page... here's a web-page with about 1771 clear differences between the output from tesseract and the output from abby finereader on the d.p. parallel experiment on "paul and the printing press". > http://z-m-l.com/go/paulp/1771tess_v_abbyy.html the top line in each pair is from tesseract, and the bottom line is from abbyy finereader. (there are more differences than this, but these are the ones that are _very_sharp...) here you can see again, in detail, the superiority of finereader over tesseract. why waste proofer eyeballs with inferior o.c.r.? also notice that some pages have a large number of cases here, where characters on the left edge of the scan were contaminated by "noise" from pages which were scanned using insufficient care. again, why waste proofer time and energy on poorly-scanned pages? *** also of note: i improved the synch on the two sets of o.c.r. so if you downloaded these files before, please do it again. here we have the original o.c.r., from tesseract: > http://z-m-l.com/go/paulp/paul-tesseract.html and here we have the new o.c.r., from finereader: > http://z-m-l.com/go/paulp/paul-abbyy.html -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/97088e9b/attachment.htm From jeroen.mailinglist at bohol.ph Tue Mar 25 15:44:30 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Tue, 25 Mar 2008 23:44:30 +0100 Subject: [gutvol-d] Moderation/censorship In-Reply-To: References: Message-ID: <47E9804E.9000408@bohol.ph> Bowerbird at aol.com wrote: > there was no :trolling". and there's been no "sabotage". > However, he was not able to resist changing the word "troll" into "vvaannddaall ttrroollll" on the page where it occurred (page 045.png of projectID47c4c0eeec634). Let us accept that as the final admission by bowerbird of being a troll. Trolls can do considerable damage to a mailing list and discussion forum. They are a nuisance similar to spam, and should be dealt with in a similar way. That is no more censorship than killing countless mentions of Viagra in your inbox, or removing a guy continuously shouting "fire" from the theater. Much serious and fruitful discussion is rendered impossible due to the high noise level introduced by a single person who takes a special pleasure in provoking people, and somehow tries frustrating things, as that is apparently the best that person can do. I'm subscribing to this list, and like to read most peoples valuable opinions, and add mine once in a while -- and will continue to do so, although I normally ignore our house troll. I invite everybody here to from now on ignore our feathered friend, and, if the nuisance gets to much, move over to the pgdp.net forums, where similar discussions are going on, without this curious part of PG culture. We could translate the issue of trolling to "Real Life" situations, which are somewhat indicative of the difficult issues at stake.... In Holland, we currently have a politician (Geert Wilders) running amok, displaying a near endless passion in his attempts to provoke Muslims, and once they are provoked, claiming, see, I told you so! He shows all the common Internet troll traits. He talks about a movie he has made, and makes all kinds of excuses about not showing it to anybody yet. A couple of years ago, a Dutch documentary maker was killed for making a movie that some Muslims considered insulting. Although I strongly believe people should be free to say what they think about Islam (or any other subject), he is purposely pushing things to the limit. His claims are grossly insulting, racist and irrational. However, almost all Muslims in Holland have remained silent, and instead Jewish organizations started to speak out. Somebody actually took 25 of his public statements, just replaced the word Muslim with Jew, printed them on a pamphlet and distributed it in a public distribution, and got himself arrested for spreading hatred against Jews. Such spreading of hatred against an ethnic or religious group is, unlike in the US, against the law here, although no action was taken against this politician in over a year of repeated and ever increasing insults. This is of course a set-up action, planned ahead to go through all levels of courts and which will give judges a very hard time to apply the law... If you convict for spreading hatred against Jews, but not Muslims, you discriminate in the application of justice, if you do not convict, you ignore the law. Jeroen. From Bowerbird at aol.com Tue Mar 25 16:25:04 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 25 Mar 2008 19:25:04 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 09 Message-ID: it occurs to me that i have been telling you about the pagescans associated with the parallel test of "paul and the printing press", but i've never actually told you directly where you can view them. of course, the images are always available from the project page: > http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea just find the link that says "view images online" and click that. note that this works all the time, for any book in the system... this one-image display works acceptably well for many purposes. *** in addition, however, for this book, i've put the scans on my website, so you can view them there using my system. for instance, this link will take you to the pagespread for page 32. > http://z-m-l.com/go/paulp/paulpp032w.html in that pagespread view, you can click the right page to go ahead, or click the left page to go backward in the book. or you can also use the links spread across the top of the two-page pagespread... (the "-chap-" and "+chap+" buttons can be very useful at times, for some purposes, because they skip from chapter to chapter...) i prefer this pagespread view, as it's more practical in some situations, _and_ it's twice as fast... :+) *** at any rate, as you step through the book's pagespreads, you'll see that the "gutter noise" problem was _intermittent_, which indicates that it was caused by insufficient care being taken on some pages... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/40a2f08c/attachment.htm From Bowerbird at aol.com Tue Mar 25 16:53:02 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 25 Mar 2008 19:53:02 EDT Subject: [gutvol-d] moderation/censorship (let us celebrate because p.g. has renounced them!) Message-ID: jeroen said: > Let us accept that as the final admission by bowerbird of being a troll. it was no such "admission" at all. (but interesting attempt at spin, jeroen.) no, it was a taunt, throwing back the word you people have thrown at me... the initial change was "troll" to "ttrroollll". and then, when piggy called that "vandalism" at one point -- at the time, i don't believe he knew who had done it -- i went in on the next round and added "vvaannddaall" to it. i knew these words would pop up in a spellcheck in post-processing -- if they even managed to get _that_ far -- so there'd be no damage. but yes, with all those caveats, i was sending a message to you, letting you know that it was _bowerbird_ who proofed that page. (just like when i posted that poll over on the d.p. forums, i included a gag response that mentioned "pudding". aha!) and piggy thought "vvaannddllee" was amusing enough that he actually put it on his blog as a word that he had made up. so at least _someone_ has an appropriate sense of humor there... > Trolls can do considerable damage > to a mailing list and discussion forum. so can small-minded people who can't deal with the logic, so they tar the other person with false charges, like "troll"... in the old days, we used to call that "ad hominem"... for years, you guys argued with me incessantly, and then had the _gall_ to _blame_me_ because you said i "wanted" a fight; as you put it "who takes a special pleasure in provoking people". you couldn't even take responsibility for your own behavior. and you still can't. so while i'm putting up post after post with hard solid _data_, you counter with this weak-ass whine. i don't provoke _people_, i provoke _thought_... none of you seem to have absolutely _anything_ to contribute in terms of intellectual discussion. it's somewhat _amazing_... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/0b604c55/attachment.htm From hyphen at hyphenologist.co.uk Wed Mar 26 01:14:23 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Wed, 26 Mar 2008 08:14:23 -0000 Subject: [gutvol-d] moderation/censorship (let us celebrate because p.g. has renounced them!) In-Reply-To: References: Message-ID: <001501c88f19$685e1610$391a4230$@co.uk> Bowerbird at aol.com wrote jeroen said: >> Let us accept that as the final admission by bowerbird of being a troll. > it was no such "admission" at all. (but interesting attempt at spin, jeroen.) In my opinion bowerbird is not a troll. The vast majority of his posts are On Topic. Expressing opinions with which others do not agree is not Trolling, it is encouraging informed debate Dave Fawthrop -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/8a13f376/attachment.htm From schultzk at uni-trier.de Wed Mar 26 02:32:57 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Wed, 26 Mar 2008 10:32:57 +0100 Subject: [gutvol-d] moderation/censorship (let us celebrate because p.g. has renounced them!) In-Reply-To: References: Message-ID: <397BD29A-CE11-42FE-83E8-7A5F55625AFE@uni-trier.de> Hi Bowerbird, More or less expressing myself. In the works you are "defended"(?). I would say La Monte refuted the claims I had commented on and proved my points. regards Keith Am 25.03.2008 um 18:17 schrieb Bowerbird at aol.com: > [snip, snip] > -bowerbird > > p.s. and keith, thanks for defending me. but i can do it myself. > with one hand tied behind my back. these guys have no punch. > if you're expressing yourself, then fine, by all means, continue... > but if you're doing it to "help" me, save your energy for later on. > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/41d68cf4/attachment-0001.htm From Bowerbird at aol.com Wed Mar 26 03:29:17 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 26 Mar 2008 06:29:17 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 09 Message-ID: oh-ho, you're gonna like this one... it's a merge of the tesseract and abby output, with the 4000-5000 identical lines in _black_. it doesn't necessarily mean that they're _correct_; usually, but _can_ mean they have identical errors. the lines which show a difference are both listed, with the top line tesseract and the bottom abby... moreover, if the lines differ in some "quasi" way -- basically whitespace or em-dashes right now -- the tesseract line is magenta, the bottom abby blue. and when lines differ in some more substantial way, the top tesseract line is red, the bottom abby blue... color makes them stand out for easier examination. you can learn a lot -- especially on auto-correction of o.c.r. errors -- by studying these difference-pairs. take my word for it... > http://z-m-l.com/go/paulp/abbyytessmerge01.html -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/36353cc6/attachment.htm From Bowerbird at aol.com Wed Mar 26 15:04:56 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 26 Mar 2008 18:04:56 EDT Subject: [gutvol-d] on the rejoining of end-line-hyphenates Message-ID: third-party dialog is a real pain in the ass... speaking of the hind quarters, that appears to be where big_bill has his head, again, and he's pontificating, again, this time on the "need" to rejoin end-of-line hyphenates... the second-grader lecturing to us as if we are first-graders... let's pretend for a minute bill's correct (even though he's not), and that a "best practices" digitization workflow would indeed rejoin end-of-line hyphenates. (it wouldn't; we're pretending.) even in this situation, it's _stupid_ to have _proofers_ rejoining. no, instead, just have the _computer_ do it. first, human energy is precious, and thus should be conserved. second, humans err. we make mistakes. some say it's the _essence_ of being human. and mistakes on the rejoining then have to get fixed themselves. so have the computer do it. in other words, have your pre-processing tool rejoin hyphenates. it can do a better job anyway, since it can access your dictionary... it also should "clothe" the em-dashes, if you're going to do that. no, that's not a "best practice" in digitization either, knucklehead. but you do it nonetheless. (that "clothe" word has to be one of the _stupidest_ words in the whole jargon of distributed proofreaders.) and have your tool close up spacey ellipses while you're at it. then your humans won't have to do all these _mundane_ tasks. because humans should _never_ have to do those mundane tasks. that's _precisely_ why i refuse to charge humans with an "error" when they "fail" to accomplish your busy-work "requirements"... so -- if you must follow this pretend "best-practice" -- at _least_ have the decency to have the computer do all the routine work... after all, that's what it's good for. understand? or did you even hear, with your head up your butt? -bowerbird p.s. the _real_ "best practices" on end-line hyphenates, exactly like everything, is to give users the option to _choose_ what they want... ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/99e0aef7/attachment.htm From paulmaas at airpost.net Wed Mar 26 16:29:13 2008 From: paulmaas at airpost.net (Paul Maas) Date: Wed, 26 Mar 2008 16:29:13 -0700 Subject: [gutvol-d] on the rejoining of end-line-hyphenates In-Reply-To: References: Message-ID: <1206574153.16077.1244489543@webmail.messagingengine.com> The "head up the ****" is uncalled for. On Wed, 26 Mar 2008 18:04:56 EDT, Bowerbird at aol.com said: > third-party dialog is a real pain in the ass... > > speaking of the hind quarters, that appears to be where > big_bill has his head, again, and he's pontificating, again, > this time on the "need" to rejoin end-of-line hyphenates... > > the second-grader lecturing to us as if we are first-graders... > > let's pretend for a minute bill's correct (even though he's not), > and that a "best practices" digitization workflow would indeed > rejoin end-of-line hyphenates. (it wouldn't; we're pretending.) > > even in this situation, it's _stupid_ to have _proofers_ rejoining. > > no, instead, just have the _computer_ do it. first, human energy > is precious, and thus should be conserved. second, humans err. > we make mistakes. some say it's the _essence_ of being human. > and mistakes on the rejoining then have to get fixed themselves. > > so have the computer do it. > > in other words, have your pre-processing tool rejoin hyphenates. > it can do a better job anyway, since it can access your dictionary... > > it also should "clothe" the em-dashes, if you're going to do that. > no, that's not a "best practice" in digitization either, knucklehead. > but you do it nonetheless. (that "clothe" word has to be one of the > _stupidest_ words in the whole jargon of distributed proofreaders.) > > and have your tool close up spacey ellipses while you're at it. > > then your humans won't have to do all these _mundane_ tasks. > > because humans should _never_ have to do those mundane tasks. > > that's _precisely_ why i refuse to charge humans with an "error" > when they "fail" to accomplish your busy-work "requirements"... > > so -- if you must follow this pretend "best-practice" -- at _least_ > have the decency to have the computer do all the routine work... > after all, that's what it's good for. > > understand? or did you even hear, with your head up your butt? > > -bowerbird > > p.s. the _real_ "best practices" on end-line hyphenates, exactly like > everything, is to give users the option to _choose_ what they want... > > > > ************** > Create a Home Theater Like the Pros. Watch the video on AOL > Home. > > (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -- Paul Maas paulmaas at airpost.net -- http://www.fastmail.fm - Accessible with your email software or over the web From Bowerbird at aol.com Wed Mar 26 17:35:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 26 Mar 2008 20:35:03 EDT Subject: [gutvol-d] on the rejoining of end-line-hyphenates Message-ID: paul said: > The "head up the ****" is uncalled for. i agree! help him pull it out! -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/853220bb/attachment.htm From hyphen at hyphenologist.co.uk Wed Mar 26 23:26:53 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Thu, 27 Mar 2008 06:26:53 -0000 Subject: [gutvol-d] on the rejoining of end-line-hyphenates In-Reply-To: References: Message-ID: <000001c88fd3$8d686680$a8393380$@co.uk> Bowerbird at aol.com wrote > speaking of the hind quarters, that appears to be where > big_bill has his head, again, and he's pontificating, again, > this time on the "need" to rejoin end-of-line hyphenates... Just a word on end of line hyphenates, and words with link hyphens. If a word with a link hyphen appears at the end of a line the line is often broken after the link hyphen. Because link hyphens disappear In later and/or modern usage, it is often impossible to tell from an old text if a hyphen at the end of a line is a link hyphen or a hyphenated word, and so if it should be rejoined or not. If in Shakespear one finds "bed- room" Should an etext contain bed-room or bedroom? See below. This is in practice of no interest to the general reader, It is of interest only to the terminal pedant and acedemics. Thus when processing much less important texts, I just do what seams right to me at the time. Computer Hyphenation, or rather Ronald C McIntosh wrote a book on the subject which should be on the web but isn't. Here is the relevant bit. >>> Chapter 10: The hyphened word Many words entered the language by first being joined together by a link-hyphen, which created a new compound word. With the passing of time most of these words became fully assimilated, eventually (but not always) dropping the hyphen. The following words occurred only once in Shakespeare's works, printed with link-hyphens, and all passed smoothly into the language: tear-ful blood-thirsty bed-room gentle-folks dis-agree out-break tear-stained earth-bound. Others failed to make the grade, such as: temple-haunting (Macbeth) and cloud-kissing (Lucrece). The successful words may have been happy inventions of the moment, fruits of bardic genius, but we might suspect that sometimes they were already established, or had been overheard by the playwright in his favourite hostelry. When Samuel Pepys was writing his diary, "every body" was two words, still awaiting either a link-hyphen or the moment when two words would suddenly be one. This process is never-ending; in particular the creation of new words is intense in America, where the link-hyphen is speedily dispensed with. Space-suit and moon-walk probably dropped their hyphens in the second edition of the newspapers which reported them. This poses another kind of problem for the printer, since many people will be uncertain whether a particular link-hyphen is necessary, and be inclined to leave it out. Nobody is likely to revive the hyphens in common words like newspaper and postman, which once were innovatory, but it is more difficult to judge modern words such as antitoxin, coaxial and coexistence. Except where there is a grammatical reason, or where the meaning could be obscured, writers may consider it safe to leave the hyphen out, perhaps posing a problem for the computer. Sir John Murray, editor of the massive OED (grandmother of every subsequent English dictionary), gave examples of meaning confirmed by the hyphen: "a day well remembered" but "a well-remembered day" "a sea of deep green" but "a deep-green sea" Fowler's MODERN ENGLISH USAGE offers: "an infallible wrinkle-remover" "the ex-Tory Solicitor-General for Scotland" (i.e. the Solicitor-General who formerly was a Tory) ne'er-do-well; stick-in-the-mud; what's-his-name This is an open-ended subject since users of English feel free to innovate (so to speak) on the hoof. The link can be useful to avoid ambiguity: re-form (=form again) as against reform (improve), and re-signing a document as against resign an employment. There are thousands of significant examples, some of them already challenged and proven in courts of law. In earlier times the hyphen was often pressed into service to solve orthographic needs, producing some strange oddities in the process. In jury records of 1658 the Puritans' elaborate compound names are graphic descriptions of their personalities, e.g. Search-the-scriptures Morton might serve beside Strong-in-the-faith Jenkinson. Their names may have been inspired by colourful long names in the Bible, such as Maher-shalal-hash-baz (Isaiah VII i) British aristocrats are still fond of their double and triple-barrelled names The Lady Caroline Jemima (1858-1946) had a five-barrelled one: Temple-Nugent-Chandos-Brydges-Grenville. A modern daring explorer, much in the public eye, is Sir Ranulf Twisleton-Wykeham-Fiennes. The most prestigious and valuable hyphen in business history was probably the one invented by a Mr Royce when he joined the company which had started out as Rolls & Co. He suggested to his new partner that they should call the business Royce-Rolls, but in 1906 they eventually chose Rolls-Royce, which became the world's most famous trademark. In 1977 a play was put on in London's West End: Rolls Hyphen Royce. <<< Dave Fawthrop -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080327/10fabe8c/attachment-0001.htm From Bowerbird at aol.com Thu Mar 27 00:28:58 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 27 Mar 2008 03:28:58 EDT Subject: [gutvol-d] on the rejoining of end-line-hyphenates Message-ID: i wondered whether -- in responding to paul's prudishness -- i should have asked him whether he wanted to go on-topic and have a thoughtful discussion about end-line hyphenates, or not. i decided it was pretty clear that he didn't. especially because, when we were done with the discussion, it would have become abundantly clear to _everyone_ here just exactly how _far_ up his butt big_bill's head actually is... so i opted for the quick reply instead... but now that _dave_ has brought up the subject... ;+) -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080327/1eccdb6a/attachment.htm From Bowerbird at aol.com Thu Mar 27 08:43:14 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 27 Mar 2008 11:43:14 EDT Subject: [gutvol-d] on the rejoining of end-line-hyphenates Message-ID: oh, yeah, since i said i wasn't gonna give d.p. any "wiggle room" any more, i should have added this: the very _idea_ that rejoining end-line hyphenates and "clothing" end-line em-dashes could be done by the computer, rather than by human proofers, seems not to have even occurred to d.p. "leaders", let alone been _acted_upon_ by them, which seems to me to be an absolutely astonishing fact... but there it is... -bowerbird p.s. i was gonna say "even donned on d.p.", but googling that made me insecure about it, since "idea donned" got _97_ hits, while "idea dawned" got _9,280_. so either my memory of the word is severely flawed, or _lots_ of people are confused. 100-1 wrong is the biggest imbalance i ever saw! (although an idea "dawning" is a little bit poetic.) either way, i decided to go with a bland "occurred". ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080327/de4dac20/attachment.htm From vze3rknp at verizon.net Thu Mar 27 09:00:46 2008 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Thu, 27 Mar 2008 12:00:46 -0400 Subject: [gutvol-d] on the rejoining of end-line-hyphenates In-Reply-To: References: Message-ID: <47EBC4AE.6000209@verizon.net> Bowerbird at aol.com wrote: > the very _idea_ that rejoining end-line hyphenates > and "clothing" end-line em-dashes could be done > by the computer, rather than by human proofers, > seems not to have even occurred to d.p. "leaders", > let alone been _acted_upon_ by them, which seems > to me to be an absolutely astonishing fact... As it happens, most content providers at DP have been doing automatic hyphenation correction for years. Towards the end of 2002, Charles Aldarondo wrote a nice little perl script that made clever use of Finereader's dehyphenation capabilities. He and I were both using it, as well as several of the other major providers of content. When thundergnat wrote guiprep, one of the prime features that he included in it was a dehyphenation tool. It can be used either in the same way that aldarondo did originally (comparing versions from Finereader where one had dehyphenation and one didn't) or in a mode where it looks for other, non-hyphenated examples of the word in the book. In either case, when it is sure, it just rejoins the hyphen. When it isn't, it leaves the hyphen in place. It's far from perfect and could use some serious revision, but at least it covers the most obvious cases. guiprep also "clothes" em-dashes automatically, which can lead to some ah, interesting, results when it comes to poetry. We intentionally don't do dehyphenation on Beginner projects, so that they will learn what it is and how to do it. And perhaps it was not done in the test projects that bowerbird has been writing about. But it is certainly done for the majority of projects and has been for the last five years. If bowerbird had had a little more experience with proofreading at DP, he would certainly have observed this. bowerbird is occasionally right about things, but in this case, he is totally off base. JulietS From Bowerbird at aol.com Thu Mar 27 10:04:18 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 27 Mar 2008 13:04:18 EDT Subject: [gutvol-d] on the rejoining of end-line-hyphenates Message-ID: juliet said: > If bowerbird had had a little more experience with proofreading at DP oh please, juliet. i can point to literally _hundreds_ of projects -- actually, probably thousands if i still had access over there -- which clearly and obviously have not been auto-dehyphenated. i'd wager that for every auto-dehyphenated file you can point to, i can point to _5_ others which were _not_ auto-dehyphenated... (and if the wager isn't too big, i'd say i'll make the margin 10-1.) maybe it is true only "beginners" get such files, i wouldn't know. but it makes the least sense of all to have _beginners_ doing a job that is error-prone and accomplished much better by a computer. i guess maybe it's a part of the hazing process? (that's commentary, folks. d.p. has no official hazing process.) in any case, it's clear that many of your content providers are ignorant of dehyphenation capabilities offered by your tools... even big_bill seems clueless, as evidenced by his statements... moreover, even the content providers who _do_ know about it don't seem to use the capability very much, as far as i can see... but if it makes you feel better, i'll check this out on your tools, and i'm glad the idea has _occurred_ to you, even if you still haven't seemed to _acted_upon_ it as well as you might have. (the caveat about poetry carries no weight with me, because i know a method that lets one avoid that problem, a method which provides additional benefits as well to the proofers...) -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080327/26d6d9d6/attachment.htm From gbuchana at teksavvy.com Thu Mar 27 16:06:19 2008 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Thu, 27 Mar 2008 19:06:19 -0400 Subject: [gutvol-d] on the rejoining of end-line-hyphenates In-Reply-To: <000001c88fd3$8d686680$a8393380$@co.uk> References: <000001c88fd3$8d686680$a8393380$@co.uk> Message-ID: <47EC286B.3050207@teksavvy.com> Dave Fawthrop wrote: > > If a word with a link hyphen appears at the end of a line > the line is often broken after the link hyphen. > Because link hyphens disappear In later and/or modern usage, > it is often impossible to tell from an old text if a hyphen at > the end of a line is a link hyphen or a hyphenated word, > and so if it should be rejoined or not. > The rules I follow are: (1) if there is another occurrence of this word in the same book, do what it does. (2) if contemporary books or other books by the same author us this word, do what they do. (3) use the modern convention. I find more often the the book is not actually self-consistent than that I can't find a way to resolve a hyphen. ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From prosfilaes at gmail.com Thu Mar 27 17:28:30 2008 From: prosfilaes at gmail.com (David Starner) Date: Thu, 27 Mar 2008 20:28:30 -0400 Subject: [gutvol-d] on the rejoining of end-line-hyphenates In-Reply-To: <1206574153.16077.1244489543@webmail.messagingengine.com> References: <1206574153.16077.1244489543@webmail.messagingengine.com> Message-ID: <6d99d1fd0803271728h66d464blc5daaf9b659f0b6c@mail.gmail.com> On Wed, Mar 26, 2008 at 7:29 PM, Paul Maas wrote: > The "head up the ****" is uncalled for. Then why did you forward it to all of us who have Bowerbird nice and killfiled? From Bowerbird at aol.com Thu Mar 27 23:37:44 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 28 Mar 2008 02:37:44 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 10 Message-ID: wow. i'm already up to 10 posts in this parallel series. (a little word-joke there.) time to take stock... this is about "paul and the printing press", the book being used for the experiment in _parallel_proofing_ by distributed proofreaders. to remind you, the project page for this book can be found here: >?? http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea and i've also put the scans for this book on my website for viewing; >?? http://z-m-l.com/go/paulp/paulpp032w.html so what have we learned so far? well, not much, really, especially if you remember that -- at the outset -- we _already_knew_ that parallel proofing has a long glorious track record. so it's not as if we needed any "confirmation" of that from a d.p. experiment. ironically however -- or maybe not, if you look at it from my perspective -- what we _did_ find is more solid evidence of the incompetence over at d.p. so what have we learned about _that_, from this experiment? 0. d.p. needs to tighten the quality standards it judges acceptable. 1. one needs to start with good page-scans. re-scan if necessary. 2. one needs to use a _good_ o.c.r. program, like abbyy finereader. now... where have i heard this before? because it sounds familiar... oh yeah, i remember, on my "10 points that d.p. needs to improve"... > 1.? ensure you have decent scans, and name them intelligently. > 2.? use a decent o.c.r. program, and ensure quality results. > 3.? do not tolerate bad text handling by content providers. > 4.? do a decent post-o.c.r. cleanup, before _any_ proofing. > 5.? retain linebreaks (don't rejoin hyphenates or clothe em-dashes). > 6.? change the ridiculous ellipse policy to something sensible. > 7.? stop doing small-cap markup with no semantic meaning. > 8.? i forget what 8 was for. > 9.? retain pagenumber information, in an unobtrusive manner. > 10.? format the ascii version using light markup, for auto-html. wow, look at that... we have an exact match on number 1 and number 2. yes sir, this chain was being undermined by some very weak links right there at the very _start_ of this project, right at the _outset_... *** bad scans caused hundreds and hundreds of unnecessary errors on this book, then bad o.c.r. made hundreds and hundreds more. when i say "bad scans", i mean that they were carelessly done, in a way that left a clear sign of incompetence on many of them, one which would cause problems with even a good o.c.r. app... and when i say "bad o.c.r.", i mean the o.c.r. was done using _tesseract_, which is a "beta" o.c.r. app that works kinda funky. for instance, it lost all of the blank lines between paragraphs... there were errors on _every_single_page_ in this entire book... and many pages had an error in _almost_every_single_line_... like "planet strappers", an incompetent content provider caused grief that cost volunteer proofers _lots_ of their time and energy, the time and energy they donate, in good faith, to a good cause... so p1 was required to make what i'd estimate as _1,750_ changes, with a huge percentage of the changes being totally unnecessary, caused by incompetence that injected errors before any proofing. the content provider could've redone their incompetent work in much less time than was spent by proofers fixing their mistake... when you treat these people like guinea piggies, they will leave, and never come back again.? is that really what you want to do? even so, the p1 proofers were miracle workers, and transformed 76 of the pages to perfection, with no further changes entered... moreover, even on the pages they "failed" to take _all_the_way_ to perfection, they got 'em very close. p2 only had to correct a small percentage of the 6500+ lines in this 200+page book, yet their 122 corrections brought another _116_ pages to perfection. (and, like p1, even the pages that weren't perfect were improved.) so p1 and p2 combined to take _84%_ of the pages to perfection! (for the record, this suggests overall d.p. accuracy is right at 60%.) this left p3 to finish _36_ pages, where they had to make changes to a tiny number of lines (39) to get us to (an assumed) perfection. altogether, p1 had just 161 lines later changed by p2 and/or p3... these lines are listed for your viewing pleasure on this web-page: > http://z-m-l.com/go/paulp/paul-p1-p3-161changes.html again, considering how rotten the scans were, the fact that p1 took _all_but_161_ of 6500+ lines to _assumed_perfection_ is amazing! (and as minuscule as that number is, it still fails to describe the awesomeness of the performance of p1, as i will discuss later.) so once again, we get the pattern i've discussed all along, the pattern that seems to capture a "common-sense" take, which is that p1 fixes most of the errors, p2 gets most of the remaining ones, and p3 comes in and does clean-up. again, this is the pattern you get on page after page, in book after book, day after day, over in d.p.-land... *** i said "assumed perfection" because we _defined_ the pages as "perfect" after p3, for the expedience of evaluating quality, but even now, though, our suspicion is that _some_ errors remain... indeed, p3 _did_ leave errors, which we can pin-point due to that parallel round of p1 proofing. yes sir, p1 proofers who did the second parallel found errors the p3 "marines" missed! but hey, by now, that shouldn't be surprising to you. of course, p3 had found errors which parallel#2 had missed, so nobody can make a clear claim of superiority based on this data. then again, p1 never claimed that it was _better_ than p3, did it? certainly never for _parallel#2_. why, that would be _heretical_! *** once before, i've mentioned these errors p3 missed. what are they? stick around for the next messages in this series, when i reveal them. *** so, after dealing with the incompetence of the content provider and demonstrating the kick-ass quality shown by the normal p1 round, we're finally able to address the assessment of the parallel proofing. finally... as i said at the outset, parallel proofing works, and we know it works. it's already proven itself over and over again, so who needs a "test"?... well, sure enough, it proved it again here, where 2 parallel rounds of p1 proofing produced results as good as the normal p1-p2-p3. the parallel proofers missed a few things the serial proofers found, and vice versa, but their overall performance was chillingly similar... so, just as with the pervious experiment using "planet strappers", the results fail to support a contention that p3 are better proofers. the parallel round of p2 matched up, and it matched up very well. also extremely spooky was the similarity of the parallel proofings... both of them had to make an estimated 1,750 changes to the text, but when i analyzed their real differences, there were under 100... and all of this recalls the eerie findings on "planet strappers", where results were so identical that they were positively freaky. i joke when i call d.p. a "cult", but i'm wondering if there _is_ something in the water over there, because this is _strange_. *** one good question would've been whether it's _cost-effective_ to make two groups of proofers find and fix the exact same errors, which is what parallel proofing forces people to do, unfortunately. regrettably, there was little in the design of this "experiment" which would help us to _answer_ that more-interesting question, however. having closely examined the sad o.c.r. produced by these bad scans, though, i can safely say that it was _not_ cost-effective on this book... just on its face, we _know_ it's a waste of their time and energy to have proofers correct an estimated 1,750 errors a second time, just so they will catch a half-dozen errors which were missed the first time around. that's a no-brainer. maybe on a clean book, parallel proofing would be cost-effective. but on a dirty book like this one, it's clear that it's a bad decision... *** however, since that time and energy was already wasted on this book, let us rejoice in the fact that we have now caught those 6 new errors... with less than 100 real differences between the two parallel proofings, perhaps less than 50, it was not difficult to do the resolution of them... and now we have a book we can justifiably feel is remarkably clean... *** so, what do we still need to do to finish the analyses of this book? *** first, i need to show you those errors that p3 missed, as well as get some feedback on some other possible errors that turned up. *** i'll also be comparing tesseract's output with o.c.r. from finereader, so i can _quantify_ exactly how many "excess" errors tesseract had. this web-page shows about 1771 clear differences between them: >?? http://z-m-l.com/go/paulp/1771tess_v_abbyy.html here's a very colorful _merge_ of the tesseract and abby output, with their identical lines in black, and the differing lines in color: >?? http://z-m-l.com/go/paulp/abbyytessmerge02.html i'll do more work on resolving these 2 sets of o.c.r. but since they are both besieged by problems from the bad pages, it's pointless to try to do much with that resolved data, as it'll always be flawed. but if piggy were to re-scan the bad pages, that would prove useful. i'll also compare the _good_ abbyy o.c.r. with our refined output, so we know exactly how close we could've gotten with good o.c.r. *** finally, i'm gonna take a look at the 161 lines p1 "failed" to perfect. you might (or might not) have noticed that was _new_ information which i just dropped into this "summary" in a fairly quiet fashion... nonetheless, i think it's an _extremely_important_ fact to process... p1 made an estimated 1,750 changes, including some that required the removal of garbage characters from the left margin, and then a keying in of totally absent words, yet by the time that p1 was done, p2 and p3 were left with a mere 161 lines (out of 6500+) to correct. that's a huge drop, from 1,750 changes (p1) to 122 (p2) to 39 (p3). _especially_ when you consider the huge p2 and p3 backlogs at d.p., the idea that the p1 people can take a book that close to perfection, only to have it then sit for months or years in a queue, is... i dunno, take your pick of "irritating", or "sad", or "troubling", or "curious", or insert your own word here to describe your reaction to that situation. and then multiply it by 4 when i tell you that the number of errors might have gone as low as 40, if a simple clean-up tool was used. (don't quote me yet on that number; wait until i prove it to you...) *** ok, so i'm done taking stock. but maybe you're sitting there with an empty feeling inside, and maybe you don't know exactly why... i can tell you why. it's because this experiment was supposedly geared to answering the question about "confidence in page"... that is, how can we know that a page is "done" being proofed? so what do we have in the form of an answer to that question? well... not much... that's because -- as some of its own people have pointed out -- d.p. doesn't really _do_ its experiments in the "scientific" mold; you know, where you frame hypotheses and develop a means of collecting data from randomly-assigned conditions that will provide evidence that disconfirms the hypotheses you're testing. d.p. experiments are more like "let's try it and see what happens". which is fine, i guess. everybody doesn't have to be a scientist... but when you're analyzing the data, that can be underwhelming. nonetheless, i did some elaboration that explains to the d.p. people -- if they are willing to listen, always a dubious assumption here -- exactly how they might go about finding an answer to that question, using the data that is already under their noses on every d.p. project. specifically, i analyzed the "progression types" i found in this book: ->?? p1-p2-p3 -- 22 pages -- (every round made a change) ->?? p1-**-p3 -- 14 pages -- (p1 and p3 made a change) ->?? **-p2-p3 -- 1 page -- (p2 and p3 made a change) ->?? p1-p2-** -- 116 pages -- (p2 made the last change to the page) ->?? p1-**-** -- 76 pages -- (p1 made the last change to the page) ->?? **-**-** -- 15 pages -- (all these no-change pages were _blank_) we'd like to believe a "no diff" means the page is perfect, or at least that the probability is very high that it's perfect, so the most puzzling pages were those 14 pages where p2 had a "no diff" on the page, but p3 did make a change. yet on those 14 pages were just _2_ actual, troubling errors. (and the parallel p1 proofers found and fixed both of them.) so i would suggest once a "no diff" round has been obtained, you could reasonably assume that the page is "clean enough". had that been done in this book, you would've saved the work of the additional round of proofing on 90 pages (76 plus 14), at the cost of missing 2 errors. that sounds reasonable to me. this rule-of-thumb would also advise that 22 pages (p1-p2-p3) should be subjected to one more round (at least) for a "no diff". i know if _i_ had to guess which pages might still harbor errors, i would guess those 22 pages, since they haven't been "verified". and, just to be clear, i am _not_ advocating that _one_ "no diff" should be used as the cutoff. i usually suggest _two_ of them. it might be overkill, but i would still use _two_ to start out with, especially since all the rounds would be done by p1 proofers, who are quite plentiful. and using p1 proofers to push books _all_ the way to perfection, not all-but-161-of-6,500-lines, sounds like a very intelligent use of resources, it surely does... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080328/1f28166b/attachment-0001.htm From Bowerbird at aol.com Fri Mar 28 11:49:41 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 28 Mar 2008 14:49:41 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 11 Message-ID: in this post, on the parallel test of "paul and the printing press", we learn what a 4th round of proofing, after a normal p1-p2-p3, buys us. the (ordinary and lowly) p1 proofers who did this round found 6 errors that had not been found previously. now of course they also missed some of the errors found in the normal workflow, so it's not as if they were better proofers. they were just _different_. you'll observe that 2 of these errors -- the second and the sixth -- could have been autodetected. but the others were pretty sneaky, including a missing comma, a missing word, and 2 stealth scannos. yet the p3 "marines" missed them, while a parallel p1 caught them. so much for the notion of "elusive errors only experts can catch"... bottom-line, though, once again, in yet another d.p. experiment, the final results are crystal-clear: the hierarchy of proofers over at distributed proofreaders does _not_ do what it was intended to do. and the huge backlogs that it _has_ caused could've been avoided, not to mention all the hard feelings that it has produced in people. *** here are _6_ errors found by parallel#2, but not p1-p2-p3: p3> "Why to print our life histories and obituaries pp> "Why, to print our life histories and obituaries p3> one passed through the school corridors, and ` pp> one passed through the school corridors, and p3> "But there are short outs," argued Mr. Cameron. pp> "But there are short cuts," argued Mr. Cameron. p3> cast, the sections of stereotype were put pp> cast, the half sections of stereotype were put p3> fine articles from patents and distant pp> fine articles from parents and distant p3> When the acounts were found to be short, pp> When the accounts were found to be short, and here is one more, not found by p1-p2-p3 or pp: p3> "'Thanks be to God, Hallelujah!' pp> "'Thanks be to God, Hallelujah!' me> "'Thanks be to God, Hallelujah!'" i won't get into a "debate" about whether these are "errors", but here are some p-book words _i_ felt should be "fixed", even if _some_ people out there might have left them as is: > skilful to skillful p#73 and p#92 > marvellous to marvelous p#93 (p-book inconsistency) > sceptical to skeptical p#130 > smooths to smoothes p#182 > signalled to signaled p#190 if you _do_ consider these as "errors", then p3 left 12 total. whether you want to call it 6 or 12, it's clear p3 ain't perfect, so those people who are looking for "perfection" from d.p. have a problem on their hands, in that even _three_ rounds of proofing is not delivering it in the present circumstances. *** in addition to outright errors, we have a lot of _questions_... here's a non-compound-word that wasn't asterisk-noted: > manuscripts, and many a one is marred by mis-spelling i was unsure about these two compound words: > Paul had had time to become really down-hearted, > "An honest blunder is one thing; but pre-meditated here are others (maybe errors, maybe not) i _did_ change: > scarfpins to scarf-pins p#65 > under-classman to underclassman p#187 i also changed a bunch of "some one" to "someone", just 'cause it looked better to me. i hope that's right. ;+) oh yeah, and i fixed "to-day", "to-morrow", "to-night", etc. i also eliminated those silly-looking characters from words like alumnae, caesar, naively, papier-mache, resume, role, and so on, just because it makes the europeans so mad when you do that... (and sorry, albrecht durer, but i had to do that to your name too.) i wasn't sure about this one, so i left it as it was: > silician queen p#44 and here's another one that has me thoroughly confused: > elaborate productions of a printing age, ecclesiasties p#92 finally, here's a funny word, to amuse you: > spondulics p#9 *** so -- all in all -- we've got around _two_dozen_ instances of questionable items there, which is about par for the course on a 200+page book, i'd say. so that needs to be taken into _account_ when we wanna talk about "achieving perfection". any time you're dealing with language, it ain't cut-and-dried. there are a lot of decisions in any book that can go either way. (of course, any d.p. people who've post-processed know this.) so it's all well and good to talk about "removing all the errors", but we need to realize that at some point, that devolves into a never-ending conversation about what _constitutes_ an error. but, by the way, it's extremely easy for me to tell you _what_ constitutes an error in _my_ book -- if, when you bring it to my attention, i _change_ it, then you have found "an error"... if i don't change it, then you have not found an error. of course, you are free to differ with my opinion. who cares? -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080328/d5272268/attachment.htm From Bowerbird at aol.com Fri Mar 28 16:22:26 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 28 Mar 2008 19:22:26 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 12 Message-ID: so, were you surprised to learn that the normal p1 proofers in the parallel proofing test of "paul and the printing press" took 6,360+ lines to perfection, with only _161_ not perfect? that's over a 97.5% accuracy-rating on _lines_ (not words or characters, which are the typical units of measure for that)... but if you are surprised by that high accuracy, you haven't been paying attention, because p1 regularly does that well. in fact, sometimes one gets accuracy that good out of o.c.r.! i did a full-on analysis you can find in the d.p. forums -- search for "a revolutionary proofing methodology -- where i found the o.c.r. from the open content alliance got just _57_ lines incorrect in a book with 8,000+ lines! (and the google o.c.r. on a different physical copy of a slightly different version had all but 300 lines correct.) face it, because of the bad scans and the use of tesseract, on a lot of the lines in this book, the p1 proofers acted as "the human o.c.r. program". and they had great accuracy! again, p1 made 1,750 changes, p2 just 122, p3 just 39... and with decent scans and decent o.c.r., the results might have been p1 with 300 changes, p2 with 30, p3 with 3... *** and even though a mere 161 imperfect lines out of 6,500+ is an amazingly high rate of quality, closer analysis of those 161 lines suggests that many could have been autodetected, meaning they should've been fixed during _preprocessing_, before they were ever even presented to volunteers to proof. it also suggests that p1's "failure" to take these 161 lines to a state of perfection can be ignored, since _postprocessing_ would easily find and fix the errors the proofers left behind. but whichever it was -- preprocessing or postprocessing -- it's clear _many_ of the bad 161 lines could've been caught. how many? well, by my count, all but _36_ could've been autodetected. (my earlier guess of _40_ ending up being pretty accurate.) i've appended my categorization of 'em, and also put it here: > http://z-m-l.com/go/paulp/paul-161-categorize.html these difficult-to-detect 36 break down like this: -> stealth scannos, 14 -> missing words, 13 -> punctuation problems, 9 some punctuation problems are easy to autodetect, such as a sentence-terminating period not followed by a capital letter. other punctuation problems are almost impossible to detect, such as the speck-induced phantom comma where a real one would not be totally inappropriate, according to the content... missing words are also extremely hard to detect automatically. stealth scannos, of course, are the prototype of hard-to-detect. but still, the fact that autodetection could have fixed _many_ of the lines on which p1 "failed", so that a mere _36_ are left which are not perfect -- out of 6,500+ lines in this book -- indicates unequivocally that p1 is doing some kick-ass work. these p1 proofers deserve far more acclaim than they receive. i'm not done yet, but i'll give you a break over the weekend... ;+) -bowerbird > http://z-m-l.com/go/paulp/paul-161-categorize.html for the d.p. parallel-proofing experiment with "paul and the printing press", analysis showed that after p1, only 161 lines were later changed by p2 and p3. of these 161 imperfect lines, 124 could easily be autodetected, and 37 could not. this indicates the p1 proofers could have taken this book even closer to perfection, with a mere _37_ lines being incorrect after a combination of p1 and autodetection... ------------------------------------------------------------------ of the 161 imperfect lines left by p1, 124 could be autodetected for easy fixing: 029 -- spellcheck -- easy to autodetect -- n=29 019 -- quotemarks, unbalanced or inappropriate -- easy to autodetect -- n=19 026 -- letter-casing -- easy to autodetect -- n=26 010 -- dehyphenation -- should be done automatically -- n=10 005 -- diacritic nonsense -- we don't need no high-bit characters -- n=5 002 -- preprocessing changes that should be standard policy -- n=2 015 -- hyphenation and em-dash -- don't count against proofers -- n=15 018 -- punctuation impossibilities -- can be autodetected -- n=18 ----- 124 lines that could be easily autodetected. of the 161 imperfect lines left by p1, 37 would be difficult to autodetect: 015 -- stealth scannos -- hard to detect -- n=15 013 -- missing/excess words -- hard to detect -- n=13 009 -- punctuation errors that are not impossibilities -- hard to detect -- n=9 ---- 037 lines that could _not_ be easily autodetected. ------------------------------------------------------------------ spellcheck -- easy to autodetect -- n=29 002 -- 008.png -- =p1=> To paralyze the Caesars -- and to stike 002 -- 008.png -- =p3=> To paralyze the Caesars -- and to strike 002 -- 008.png -- diff> ====================================^^^ 009 -- 025.png -- =p1=> the Fire-eater! Have a copy of the Jabbermock! 009 -- 025.png -- =p3=> the Fire-eater! Have a copy of the Jabberwock! 009 -- 025.png -- diff> =========================================^==== 011 -- 026.png -- =p1=> "The March Hare!" he repeated wlth enthusiasm. 011 -- 026.png -- =p3=> "The March Hare!" he repeated with enthusiasm. 011 -- 026.png -- diff> ===============================^============== 017 -- 042.png -- =p1=> firm of George L. Kirnball and from Dalrymple 017 -- 042.png -- =p3=> firm of George L. Kimball and from Dalrymple 017 -- 042.png -- diff> ====================^^^^=^^^^^^^^^^^^^^^^^^^^ 025 -- 050.png -- =p1=> "Thus, you see, was the eopyist forced to 025 -- 050.png -- =p3=> "Thus, you see, was the copyist forced to 025 -- 050.png -- diff> ========================^================ 037 -- 068.png -- =p1=> "Yes, and not only were the first manuseripts 037 -- 068.png -- =p3=> "Yes, and not only were the first manuscripts 037 -- 068.png -- diff> =======================================^===== 043 -- 079.png -- =p1=> else in the paper. Sorne thought more 043 -- 079.png -- =p3=> else in the paper. Some thought more 043 -- 079.png -- diff> =====================^^^^^^^^^^^^^^^^ 052 -- 084.png -- =p1=> impulse is a very seliish one," said his father. 052 -- 084.png -- =p3=> impulse is a very selfish one," said his father. 052 -- 084.png -- diff> =====================^========================== 053 -- 089.png -- =p1=> and Diamonds for the more prosperous ` 053 -- 089.png -- =p3=> and Diamonds for the more prosperous 053 -- 089.png -- diff> ====================================^^ 044 -- 080.png -- =p1=> smoothed away his objectious until, upon a 044 -- 080.png -- =p3=> smoothed away his objections until, upon a 044 -- 080.png -- diff> ==========================^=============== 045 -- 080.png -- =p1=> body of workers hnally stood shoulder to shoulder, 045 -- 080.png -- =p3=> body of workers finally stood shoulder to shoulder, 045 -- 080.png -- diff> ================^^^^=^^^^^=^^^^^^^^^^^^^^^^^^^^^^^ 046 -- 080.png -- =p1=> finer and more efiicient. It was, as Paul 046 -- 080.png -- =p3=> finer and more efficient. It was, as Paul 046 -- 080.png -- diff> =================^======================= 048 -- 081.png -- =p1=> Into Pau1's editorial sanctum articles from 048 -- 081.png -- =p3=> Into Paul's editorial sanctum articles from 048 -- 081.png -- diff> ========^================================== 049 -- 082.png -- =p1=> various sources one number after another of ` 049 -- 082.png -- =p3=> various sources one number after another of 049 -- 082.png -- diff> ===========================================^^ 050 -- 084.png -- =p1=> like to write up fires and aceidents and wear a 050 -- 084.png -- =p3=> like to write up fires and accidents and wear a 050 -- 084.png -- diff> =============================^================= 084 -- 156.png -- =p1=> Paul. Page 13T. 084 -- 156.png -- =p3=> Paul. Page 137. 084 -- 156.png -- diff> =============^= 096 -- 179.png -- =p1=> visit to a big newspaper offfice Saturday evening 096 -- 179.png -- =p3=> visit to a big newspaper office Saturday evening 096 -- 179.png -- diff> ============================^^^^^^^^^^^^^^^^^^^^^ 097 -- 181.png -- =p1=> you must remember that it was especially diffcult 097 -- 181.png -- =p3=> you must remember that it was especially difficult 097 -- 181.png -- diff> =============================================^^^^ 098 -- 182.png -- =p1=> "So, son," concluded Mr. wright, "you've 098 -- 182.png -- =p3=> "So, son," concluded Mr. Wright, "you've 098 -- 182.png -- diff> =========================^============== 099 -- 182.png -- =p1=> approve of the fity-dollar bill which at that 099 -- 182.png -- =p3=> approve of the fifty-dollar bill which at that 099 -- 182.png -- diff> =================^^^^^^=^^^^^^=^^^^^^^^^^^^^^ 110 -- 193.png -- =p1=> process and know how the brst printing 110 -- 193.png -- =p3=> process and know how the first printing 110 -- 193.png -- diff> =========================^^^^^^^^^^^^^ 112 -- 196.png -- =p1=> of each shelf classined and marked." 112 -- 196.png -- =p3=> of each shelf classified and marked." 112 -- 196.png -- diff> ====================^^^^^^^^^^^^^^^^ 113 -- 200.png -- =p1=> They had now reached the lowest Hoor and 113 -- 200.png -- =p3=> They had now reached the lowest floor and 113 -- 200.png -- diff> ================================^^=^^^^^ 117 -- 204.png -- =p1=> little chap over there by the bre hangs our 117 -- 204.png -- =p3=> little chap over there by the fire hangs our 117 -- 204.png -- diff> ==============================^^^^^^^^^^^^^ 131 -- 214.png -- =p1=> Deeker, rolling his eyes up to the ceiling with 131 -- 214.png -- =p3=> Decker, rolling his eyes up to the ceiling with 131 -- 214.png -- diff> ==^============================================ 142 -- 226.png -- =p1=> wont, in unselhsh fashion, to let every one else 142 -- 226.png -- =p3=> wont, in unselfish fashion, to let every one else 142 -- 226.png -- diff> ==============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 151 -- 233.png -- =p1=> delivered was clicked offon Mr. Carter's typewriter 151 -- 233.png -- =p3=> delivered was clicked off on Mr. Carter's typewriter 151 -- 233.png -- diff> =========================^^^^^^^^^^^^^^^^^^^^^^^^^^ 155 -- 236.png -- =p1=> Carneron was a big enough man to be forgiving. 155 -- 236.png -- =p3=> Cameron was a big enough man to be forgiving. 155 -- 236.png -- diff> ==^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 158 -- 237.png -- =p1=> and joy to the crowning event of l920's 158 -- 237.png -- =p3=> and joy to the crowning event of 1920's 158 -- 237.png -- diff> =================================^===== quotemarks, unbalanced or inappropriate -- easy to autodetect -- n=19 005 -- 018.png -- =p1=> The better way to go at such an undertaking," 005 -- 018.png -- =p3=> "The better way to go at such an undertaking," 005 -- 018.png -- diff> ^^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 007 -- 023.png -- =p1=> asserted at length. " But the ducats -- where 007 -- 023.png -- =p3=> asserted at length. "But the ducats -- where 007 -- 023.png -- diff> =====================^^^^^^^^^^^^^^^^=^^^^^^^ 013 -- 033.png -- =p1=> back a step or two. " I couldn't, Kip. Don't 013 -- 033.png -- =p3=> back a step or two. "I couldn't, Kip. Don't 013 -- 033.png -- diff> =====================^^^^^^^^^^^^^^^^^^^^^^^ 015 -- 036.png -- =p1=> So you're Paul Cameron. I've had dealings 015 -- 036.png -- =p3=> "So you're Paul Cameron. I've had dealings 015 -- 036.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 018 -- 043.png -- =p1=> the Echo?"' 018 -- 043.png -- =p3=> the Echo?" 018 -- 043.png -- diff> ==========^ 019 -- 045.png -- =p1=> "Oh, it's not that," said Paul quickly. " We 019 -- 045.png -- =p3=> "Oh, it's not that," said Paul quickly. "We 019 -- 045.png -- diff> =========================================^^^ 020 -- 046.png -- =p1=> People didn't always use to have paper, 020 -- 046.png -- =p3=> "People didn't always use to have paper, 020 -- 046.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 024 -- 049.png -- =p1=> "Thanks be to God, Hallelujah!' 024 -- 049.png -- =p3=> "'Thanks be to God, Hallelujah!' 024 -- 049.png -- diff> =^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^ 029 -- 053.png -- =p1=> Paul waited an instant, then added dryly: " In 029 -- 053.png -- =p3=> Paul waited an instant, then added dryly: "In 029 -- 053.png -- diff> ===========================================^^^ 058 -- 094.png -- =p1=> in years!" ejaculated the postmaster. " Seems 058 -- 094.png -- =p3=> in years!" ejaculated the postmaster. "Seems 058 -- 094.png -- diff> =======================================^^=^^^ 105 -- 189.png -- =p1=> surface.' 105 -- 189.png -- =p3=> surface." 105 -- 189.png -- diff> ========^ 119 -- 204.png -- =p1=> we ought to pay more for our newspapers.' 119 -- 204.png -- =p3=> we ought to pay more for our newspapers." 119 -- 204.png -- diff> ========================================^ 130 -- 213.png -- =p1=> he heard himself saying, " I'd call it a beastly 130 -- 213.png -- =p3=> he heard himself saying, "I'd call it a beastly 130 -- 213.png -- diff> ==========================^^^^^^^=^^^^^^^^^^^^^^ 132 -- 214.png -- =p1=> "Nothing! 'Cut it out, that's all." 132 -- 214.png -- =p3=> "Nothing! Cut it out, that's all." 132 -- 214.png -- diff> ==========^^^^^^^^^^^^^^^^^^^^^=^^^ 133 -- 215.png -- =p1=> with the boy?' 133 -- 215.png -- =p3=> with the boy? 133 -- 215.png -- diff> =============^ 139 -- 222.png -- =p1=> Mr. Carter -- " you were just right, son. The 139 -- 222.png -- =p3=> Mr. Carter -- "you were just right, son. The 139 -- 222.png -- diff> ===============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 140 -- 224.png -- =p1=> "How are you, old man,' Paul called jubilantly. 140 -- 224.png -- =p3=> "How are you, old man," Paul called jubilantly. 140 -- 224.png -- diff> ======================^======================== 149 -- 231.png -- =p1=> Paul. " But it's all right now. The 149 -- 231.png -- =p3=> Paul. "But it's all right now. The 149 -- 231.png -- diff> =======^^^^^^^^^^^=^^^^^^^^^^^^^^^^ 157 -- 236.png -- =p1=> Cameron.' 157 -- 236.png -- =p3=> Cameron." 157 -- 236.png -- diff> ========^ letter-casing -- easy to autodetect -- n=26 006 -- 022.png -- =p1=> "Say, Cart, what do you think of '20 Starting 006 -- 022.png -- =p3=> "Say, Cart, what do you think of '20 starting 006 -- 022.png -- diff> =====================================^======= 012 -- 028.png -- =p1=> Kipper. we'll see what we can do toward 012 -- 028.png -- =p3=> Kipper. We'll see what we can do toward 012 -- 028.png -- diff> ========^============================== 022 -- 049.png -- =p1=> the patient Workers were so glad when their 022 -- 049.png -- =p3=> the patient workers were so glad when their 022 -- 049.png -- diff> ============^============================== 027 -- 052.png -- =p1=> the great objection to this method was that several 027 -- 052.png -- =p3=> The great objection to this method was that several 027 -- 052.png -- diff> ^================================================== 041 -- 079.png -- =p1=> Was quite an eye opener! A paper for general 041 -- 079.png -- =p3=> was quite an eye opener! A paper for general 041 -- 079.png -- diff> ^=========================================== 042 -- 079.png -- =p1=> Burmingham. There Was actually something 042 -- 079.png -- =p3=> Burmingham. There was actually something 042 -- 079.png -- diff> ==================^===================== 054 -- 090.png -- =p1=> was one of the later and most skilful Woodcut 054 -- 090.png -- =p3=> was one of the later and most skilful woodcut 054 -- 090.png -- diff> ======================================^====== 066 -- 131.png -- =p1=> what was to be done? 066 -- 131.png -- =p3=> What was to be done? 066 -- 131.png -- diff> ^=================== 067 -- 132.png -- =p1=> I can't understand it. we haven't branched 067 -- 132.png -- =p3=> I can't understand it. We haven't branched 067 -- 132.png -- diff> =======================^================== 061 -- 111.png -- =p1=> or enamel. As time Went on and the religious 061 -- 111.png -- =p3=> or enamel. As time went on and the religious 061 -- 111.png -- diff> ===================^======================== 063 -- 117.png -- =p1=> cultured nation. By no means. what I mean 063 -- 117.png -- =p3=> cultured nation. By no means. What I mean 063 -- 117.png -- diff> ==============================^========== 065 -- 120.png -- =p1=> "Typewriters Come at all prices," his father 065 -- 120.png -- =p3=> "Typewriters come at all prices," his father 065 -- 120.png -- diff> =============^============================== 069 -- 134.png -- =p1=> "Something's fussing you. what is it?" 069 -- 134.png -- =p3=> "Something's fussing you. What is it?" 069 -- 134.png -- diff> ==========================^=========== 070 -- 135.png -- =p1=> Bond" was converted into cash; Paul'S typewriter 070 -- 135.png -- =p3=> Bond" was converted into cash; Paul's typewriter 070 -- 135.png -- diff> ====================================^=========== 088 -- 161.png -- =p1=> the machine's myriad advantages. wasn't it 088 -- 161.png -- =p3=> the machine's myriad advantages. Wasn't it 088 -- 161.png -- diff> =================================^======== 089 -- 162.png -- =p1=> March Hare Would branch out and be made 089 -- 162.png -- =p3=> March Hare would branch out and be made 089 -- 162.png -- diff> ===========^=========================== 092 -- 168.png -- =p1=> largest industries. we cannot do without 092 -- 168.png -- =p3=> largest industries. We cannot do without 092 -- 168.png -- diff> ====================^=================== 093 -- 173.png -- =p1=> school, and all the Web of circumstances in 093 -- 173.png -- =p3=> school, and all the web of circumstances in 093 -- 173.png -- diff> ====================^====================== 095 -- 177.png -- =p1=> a press Was built up Which is so intricate and 095 -- 177.png -- =p3=> a press was built up which is so intricate and 095 -- 177.png -- diff> ========^============^======================== 104 -- 188.png -- =p1=> have the main idea and When I see the thing in 104 -- 188.png -- =p3=> have the main idea and when I see the thing in 104 -- 188.png -- diff> =======================^====================== 107 -- 190.png -- =p1=> "I See" 107 -- 190.png -- =p3=> "I see." 107 -- 190.png -- diff> ===^==^ 111 -- 194.png -- =p1=> "I See." 111 -- 194.png -- =p3=> "I see." 111 -- 194.png -- diff> ===^==== 115 -- 201.png -- =p1=> during the war," Stammered Paul. 115 -- 201.png -- =p3=> during the war," stammered Paul. 115 -- 201.png -- diff> =================^============== 118 -- 204.png -- =p1=> and Paul Smiled in return. 118 -- 204.png -- =p3=> and Paul smiled in return. 118 -- 204.png -- diff> =========^================ 148 -- 229.png -- =p1=> wretchedly. "That's what'S got me fussed. 148 -- 229.png -- =p3=> wretchedly. "That's what's got me fussed. 148 -- 229.png -- diff> =========================^=============== 150 -- 232.png -- =p1=> that money. It's caused too much Worry already." 150 -- 232.png -- =p3=> that money. It's caused too much worry already." 150 -- 232.png -- diff> =================================^============== dehyphenation -- should be done automatically -- easy to detect -- n=10 008 -- 023.png -- =p1=> "I suppose we couldn't buy a press secondhand 008 -- 023.png -- =p3=> "I suppose we couldn't buy a press second-hand 008 -- 023.png -- diff> =========================================^^^^ 034 -- 065.png -- =p1=> to what methods you resorted to win these con- 034 -- 065.png -- =p3=> to what methods you resorted to win these concessions 034 -- 065.png -- diff> =============================================^ 035 -- 065.png -- =p1=> cessions from these stern-purposed gentlemen. 035 -- 065.png -- =p3=> from these stern-purposed gentlemen. 035 -- 065.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 030 -- 055.png -- =p1=> "Mr. Carter said Judge Damon was an ex- 030 -- 055.png -- =p3=> "Mr. Carter said Judge Damon was an expert 030 -- 055.png -- diff> ======================================^ 031 -- 055.png -- =p1=> pert on international law," explained Paul. 031 -- 055.png -- =p3=> on international law," explained Paul. 031 -- 055.png -- diff> ^^^^^^^^^^=^^==^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 040 -- 075.png -- =p1=> ways at liberty to send contributions back with 040 -- 075.png -- =p3=> at liberty to send contributions back with 040 -- 075.png -- diff> ^^^^^^^^^^^^^^^^^^=^^=^^^^^=^^^^^^^^^=^^^^^^^^^ 125 -- 210.png -- =p1=> he wanted to sell them. Father said so. Be 125 -- 210.png -- =p3=> he wanted to sell them. Father said so. Besides, 125 -- 210.png -- diff> ========================================== 126 -- 210.png -- =p1=> sides, what's to become of 1921 if you sell out 126 -- 210.png -- =p3=> what's to become of 1921 if you sell out 126 -- 210.png -- diff> ^^^^^^=^^^^^^^^^=^^^^^^^^^^^^^^=^^^^^^^^^^^^^^^ 127 -- 212.png -- =p1=> "What else could we sell it out for, fathead?" 127 -- 212.png -- =p3=> "What else could we sell it out for, fat-head?" 127 -- 212.png -- diff> ========================================^^^^^^ 156 -- 236.png -- =p1=> "An honest blunder is one thing; but premeditated 156 -- 236.png -- =p3=> "An honest blunder is one thing; but pre-meditated 156 -- 236.png -- diff> ========================================^^^^^^^^^ diacritic nonsense -- we don't need no high-bit characters -- easy to detect -- n=5 047 -- 081.png -- =p1=> manager; the alumnae, now scattered in 047 -- 081.png -- =p3=> manager; the alumn?, now scattered in 047 -- 081.png -- diff> ==================^^^^^^^^^^^=^^^^^^^^ 062 -- 114.png -- =p1=> at all. They get a scenario or resume of the 062 -- 114.png -- =p3=> at all. They get a scenario or r?sum? of the 062 -- 114.png -- diff> ================================^===^======= 072 -- 140.png -- =p1=> the contrary it naively confessed that it was 072 -- 140.png -- =p3=> the contrary it na?vely confessed that it was 072 -- 140.png -- diff> ==================^========================== 109 -- 192.png -- =p1=> cardboard, a sort of papier-mache, and by forcing 109 -- 192.png -- =p3=> cardboard, a sort of papier-mach?, and by forcing 109 -- 192.png -- diff> ================================^================ 120 -- 205.png -- =p1=> alumnae. Judge Damon had taken to contributing 120 -- 205.png -- =p3=> alumn?. Judge Damon had taken to contributing 120 -- 205.png -- diff> =====^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ preproccessing changes that should be standard policy -- easy to detect -- n=2 079 -- 154.png -- =p1=> "We'll talk no more about this matter today," 079 -- 154.png -- =p3=> "We'll talk no more about this matter to-day," 079 -- 154.png -- diff> ========================================^^^^^ 138 -- 218.png -- =p1=> only that he dreaded... The knob turned 138 -- 218.png -- =p3=> only that he dreaded.... The knob turned 138 -- 218.png -- diff> =======================^^^^^^^^^^^^^^^^ hyphenation and em-dash escapades -- i won't count these against proofers -- easy to detect -- n=15 036 -- 065.png -- =p1=> "The judge, for example-I can't imagine 036 -- 065.png -- =p3=> "The judge, for example -- I can't imagine 036 -- 065.png -- diff> =======================^^^^^^^^^^^^^^^^ 068 -- 134.png -- =p1=> "Could you manage it-fifty dollars?" 068 -- 134.png -- =p3=> "Could you manage it -- fifty dollars?" 068 -- 134.png -- diff> ====================^^^^^^^^^^^^^^^^ 073 -- 144.png -- =p1=> was no easy task. It was a thankless job, anywy 073 -- 144.png -- =p3=> was no easy task. It was a thankless job, anyway -- the 073 -- 144.png -- diff> ==============================================^ 074 -- 144.png -- =p1=> -- the least interesting of any of the positions 074 -- 144.png -- =p3=> least interesting of any of the positions 074 -- 144.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^=^====^^^=^^^^^^^^^^^^^^^^ 100 -- 185.png -- =p1=> Paul had had time to become really downhearted, 100 -- 185.png -- =p3=> Paul had had time to become really down-hearted, 100 -- 185.png -- diff> =======================================^^^^^^^^ 128 -- 212.png -- =p1=> "But -- to sell it out for cash, as it stands -- 128 -- 212.png -- =p3=> "But -- to sell it out for cash, as it stands -- you 128 -- 212.png -- diff> ================================================ 129 -- 212.png -- =p1=> you mean that?" 129 -- 212.png -- =p3=> mean that?" 129 -- 212.png -- diff> ^^^^^^^^^^^^^^^ 134 -- 217.png -- =p1=> be confessing that he had failed in his mission, 134 -- 217.png -- =p3=> be confessing that he had failed in his mission, -- nay, 134 -- 217.png -- diff> ================================================ 135 -- 217.png -- =p1=> -- nay, worse than that, that he had not even 135 -- 217.png -- =p3=> worse than that, that he had not even 135 -- 217.png -- diff> ^^^^^^^^^^^^^^=^^^^^^^^^=^^^^^^^=^^^^^^^^^^^^ 136 -- 217.png -- =p1=> come, something within him had leaped into being, 136 -- 217.png -- =p3=> come, something within him had leaped into being, -- something 136 -- 217.png -- diff> ================================================= 137 -- 217.png -- =p1=> -- something that had automatically prevented 137 -- 217.png -- =p3=> that had automatically prevented 137 -- 217.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 146 -- 228.png -- =p1=> me to deposit some money in the bank for him 146 -- 228.png -- =p3=> me to deposit some money in the bank for him -- a 146 -- 228.png -- diff> ============================================ 147 -- 228.png -- =p1=> -- a hundred-dollar bill. I put the envelope in 147 -- 228.png -- =p3=> hundred-dollar bill. I put the envelope in 147 -- 228.png -- diff> ^^^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^ 160 -- 238.png -- =p1=> when weary, sleepy, but triumphant, a half 160 -- 238.png -- =p3=> when weary, sleepy, but triumphant, a half-jubilant, 160 -- 238.png -- diff> ========================================== 161 -- 238.png -- =p1=> jubilant, half-sorrowful lot of girls and boys 161 -- 238.png -- =p3=> half-sorrowful lot of girls and boys 161 -- 238.png -- diff> ^^^^^^^^^^^^^^^^=^^=^^^^^=^^^^^=^^^^^^^^^^^^^^ punctuation impossibilities -- bad constructions -- easy to detect -- n=18 001 -- 007.png -- =p1=> Copyright, 1920 001 -- 007.png -- =p3=> Copyright, 1920, 001 -- 007.png -- diff> =============== 010 -- 025.png -- =p1=> Hare it is! We"ll begin getting subscriptions 010 -- 025.png -- =p3=> Hare it is! We'll begin getting subscriptions 010 -- 025.png -- diff> ==============^============================== 039 -- 072.png -- =p1=> them would fill a room.:" 039 -- 072.png -- =p3=> them would fill a room." 039 -- 072.png -- diff> =======================^^ 055 -- 090.png -- =p1=> woodcut was to art -- simple, direct, appealing" 055 -- 090.png -- =p3=> woodcut was to art -- simple, direct, appealing." 055 -- 090.png -- diff> ===============================================^ 059 -- 096.png -- =p1=> John Gutenburg,a native of Strasburg, who 059 -- 096.png -- =p3=> John Gutenburg, a native of Strasburg, who 059 -- 096.png -- diff> ===============^^^^^^^^^^^^^^^^^^^^^^^^^^ 071 -- 138.png -- =p1=> a patronizing scorn, For a press of the Echo's 071 -- 138.png -- =p3=> a patronizing scorn. For a press of the Echo's 071 -- 138.png -- diff> ===================^========================== 075 -- 147.png -- =p1=> "How is your paper coming on, Paul?," he 075 -- 147.png -- =p3=> "How is your paper coming on, Paul?" he 075 -- 147.png -- diff> ===================================^^^^^ 076 -- 150.png -- =p1=> "B -- u -- t-" stammered Paul and then 076 -- 150.png -- =p3=> "B -- u -- t -- " stammered Paul and then 076 -- 150.png -- diff> ============^^^^^^^^^^^^^^^^^^^^^^^^^^ 077 -- 153.png -- =p1=> "I -- I-" faltered Paul. 077 -- 153.png -- =p3=> "I -- I -- " faltered Paul. 077 -- 153.png -- diff> =======^^^^^^^^^^^^^^^^^ 078 -- 153.png -- =p1=> "I don't quite-" 078 -- 153.png -- =p3=> "I don't quite -- " 078 -- 153.png -- diff> ==============^^ 080 -- 155.png -- =p1=> fifty-dollar bond I have" 080 -- 155.png -- =p3=> fifty-dollar bond I have." 080 -- 155.png -- diff> ========================^ 081 -- 155.png -- =p1=> it." t 081 -- 155.png -- =p3=> it." 081 -- 155.png -- diff> ====^^ 082 -- 155.png -- =p1=> Mr. Carter winked 082 -- 155.png -- =p3=> Mr. Carter winked. 082 -- 155.png -- diff> ================= 083 -- 155.png -- =p1=> "I see," he said 083 -- 155.png -- =p3=> "I see," he said. 083 -- 155.png -- diff> ================ 085 -- 158.png -- =p1=> prefer, A loan with a bond for security is 085 -- 158.png -- =p3=> prefer. A loan with a bond for security is 085 -- 158.png -- diff> ======^=================================== 086 -- 158.png -- =p1=> :But -- " 086 -- 158.png -- =p3=> "But -- " 086 -- 158.png -- diff> ^======== 091 -- 165.png -- =p1=> quantities of paper," answered his father; 091 -- 165.png -- =p3=> quantities of paper," answered his father. 091 -- 165.png -- diff> =========================================^ 123 -- 207.png -- =p1=> his classmates to earn it, -- -for earn it he must, 123 -- 207.png -- =p3=> his classmates to earn it, -- for earn it he must, 123 -- 207.png -- diff> ==============================^^^^^^^^^^^^^^^^^^^^^ stealth scannos -- hard to detect -- n=15 003 -- 017.png -- =p1=> "Enough to till a good-sized daily, I should 003 -- 017.png -- =p3=> "Enough to fill a good-sized daily, I should 003 -- 017.png -- diff> ===========^================================ 004 -- 018.png -- =p1=> expensive piece of property, my son," he relied. 004 -- 018.png -- =p3=> expensive piece of property, my son," he replied. 004 -- 018.png -- diff> ===========================================^^^^^ 033 -- 063.png -- =p1=> the judge mischievously. "It you boys propose 033 -- 063.png -- =p3=> the judge mischievously. "If you boys propose 033 -- 063.png -- diff> ===========================^================= 087 -- 160.png -- =p1=> Paul lingered the bill nervously. Fifty dollars! 087 -- 160.png -- =p3=> Paul fingered the bill nervously. Fifty dollars! 087 -- 160.png -- diff> =====^========================================== 090 -- 164.png -- =p1=> money and government notes are line examples 090 -- 164.png -- =p3=> money and government notes are fine examples 090 -- 164.png -- diff> ===============================^============ 094 -- 176.png -- =p1=> press rooms for striking oil proof when the 094 -- 176.png -- =p3=> press rooms for striking off proof when the 094 -- 176.png -- diff> ==========================^^=============== 103 -- 187.png -- =p1=> This east is then fitted upon the rollers 103 -- 187.png -- =p3=> This cast is then fitted upon the rollers 103 -- 187.png -- diff> =====^=================================== 106 -- 190.png -- =p1=> a small space allowed it; N, too, is not much in 106 -- 190.png -- =p3=> a small space allowed it; X, too, is not much in 106 -- 190.png -- diff> ==========================^===================== 108 -- 192.png -- =p1=> large metal sections that lit on the two halves of 108 -- 192.png -- =p3=> large metal sections that fit on the two halves of 108 -- 192.png -- diff> ==========================^======================= 122 -- 206.png -- =p1=> bid good-by to the familiar balls of the school, 122 -- 206.png -- =p3=> bid good-by to the familiar halls of the school, 122 -- 206.png -- diff> ============================^=================== 141 -- 225.png -- =p1=> hollowing them out and tilling them up again 141 -- 225.png -- =p3=> hollowing them out and filling them up again 141 -- 225.png -- diff> =======================^==================== 143 -- 226.png -- =p1=> loyally refusing to peach on his churns. That 143 -- 226.png -- =p3=> loyally refusing to peach on his chums. That 143 -- 226.png -- diff> ====================================^^^^^^^^^ 145 -- 227.png -- =p1=> "They say there always has to be a fist time. 145 -- 227.png -- =p3=> "They say there always has to be a first time. 145 -- 227.png -- diff> =====================================^^^^^^^^ 152 -- 234.png -- =p1=> "And I oh yours, Mr. Carter. Melville is a 152 -- 234.png -- =p3=> "And I on yours, Mr. Carter. Melville is a 152 -- 234.png -- diff> ========^================================= 159 -- 237.png -- =p1=> course, the far-tamed March Hare. Its advent 159 -- 237.png -- =p3=> course, the far-famed March Hare. Its advent 159 -- 237.png -- diff> ================^=========================== missing/excess words -- hard to detect -- n=13 014 -- 036.png -- =p1=> of being shrewd, close-fisted, and 014 -- 036.png -- =p3=> reputation of being shrewd, close-fisted, and 014 -- 036.png -- diff> ^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^^ 021 -- 046.png -- =p1=> kings, bishops, and persons of rank could 021 -- 046.png -- =p3=> many kings, bishops, and persons of rank could 021 -- 046.png -- diff> ^^=^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^ 026 -- 051.png -- =p1=> and were sold to of the Church or to 026 -- 051.png -- =p3=> and were sold to dignitaries of the Church or to 026 -- 051.png -- diff> =================^^^^^^^^^^^^^^^^^^^ 028 -- 053.png -- =p1=> ??line missing here...?? 028 -- 053.png -- =p3=> "Yes." 028 -- 053.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^ 038 -- 070.png -- =p1=> I have already explained, care much for reading; 038 -- 070.png -- =p3=> have already explained, care much for reading; 038 -- 070.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^ 051 -- 084.png -- =p1=> a under the ropes." 051 -- 084.png -- =p3=> under the ropes." 051 -- 084.png -- diff> ^^^^^^^^^^^^^^^^^^^ 056 -- 090.png -- =p1=> is public that desired to read -- which this one did 056 -- 090.png -- =p3=> public that desired to read -- which this one did 056 -- 090.png -- diff> ^^^^^^^^^^=^^^^^^^^^^^=^^^^^^^=^^^^=^^=^^^^^^^^^^^^^ 057 -- 092.png -- =p1=> More than one dignified resident of town struggled 057 -- 092.png -- =p3=> More than one dignified resident of the town struggled 057 -- 092.png -- diff> =====================================^^^^^^^^^^^^^ 064 -- 119.png -- =p1=> author the prey of vultures who 064 -- 119.png -- =p3=> author was the prey of vultures who 064 -- 119.png -- diff> =======^^^=^^=^^^^^^^^^^^^^^^^^ 101 -- 186.png -- =p1=> their days." 101 -- 186.png -- =p3=> their days. "I'm going to take you upstairs 101 -- 186.png -- diff> ===========^ 102 -- 186.png -- =p1=> Mr. Hawley said briskly. "We may 102 -- 186.png -- =p3=> first," Mr. Hawley said briskly. "We may 102 -- 186.png -- diff> ^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^ 116 -- 202.png -- =p1=> publishers." I 116 -- 202.png -- =p3=> publishers." 116 -- 202.png -- diff> ============^^ 124 -- 209.png -- =p1=> "Because -- well-it Would be so yellow," 124 -- 209.png -- =p3=> "Because -- well -- it would be so darn yellow," 124 -- 209.png -- diff> ================^^^=^^^^^^^^=^^=^^^^^^^^ punctuation errors that are not impossibilities -- hard to detect -- n=9 016 -- 038.png -- =p1=> pay too." 016 -- 038.png -- =p3=> pay, too." 016 -- 038.png -- diff> ===^^^=^^ 023 -- 049.png -- =p1=> "This book was illuminated, bound, and 023 -- 049.png -- =p3=> "'This book was illuminated, bound, and 023 -- 049.png -- diff> =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^ 032 -- 059.png -- =p1=> Cameron." Call them up this minute and nail 032 -- 059.png -- =p3=> Cameron. "Call them up this minute and nail 032 -- 059.png -- diff> ========^^================================= 060 -- 096.png -- =p1=> was the principle of it is identical with that 060 -- 096.png -- =p3=> was, the principle of it is identical with that 060 -- 096.png -- diff> ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 114 -- 201.png -- =p1=> periodicals, "Mr. Hawley managed to shout 114 -- 201.png -- =p3=> periodicals," Mr. Hawley managed to shout 114 -- 201.png -- diff> ============^^=========================== 121 -- 205.png -- =p1=> and two of Burminghams graduates 121 -- 205.png -- =p3=> and two of Burmingham's graduates 121 -- 205.png -- diff> =====================^^^^^^^^^^^ 144 -- 227.png -- =p1=> the five hundredth-time Don had been caught 144 -- 227.png -- =p3=> the five hundredth -- time Don had been caught 144 -- 227.png -- diff> ==================^^^^^^^^^^^^^^^^^^^^^^^^^ 153 -- 235.png -- =p1=> In fact," he continued, lapsing into seriousness," 153 -- 235.png -- =p3=> In fact," he continued, lapsing into seriousness, 153 -- 235.png -- diff> =================================================^ 154 -- 235.png -- =p1=> the younger generation teaches us 154 -- 235.png -- =p3=> "the younger generation teaches us 154 -- 235.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ------------------------------------------------------------------ ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080328/21d584e5/attachment-0001.htm From paulmaas at airpost.net Fri Mar 28 19:04:48 2008 From: paulmaas at airpost.net (Paul Maas) Date: Fri, 28 Mar 2008 19:04:48 -0700 Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs, all typos are shallow" Message-ID: <1206756288.4463.1244898735@webmail.messagingengine.com> A break from the bowerbird mass flood: http://www.shirky.com/herecomeseverybody/2008/03/given-enough-eyeballs-all-typo.html -- Paul Maas paulmaas at airpost.net -- http://www.fastmail.fm - One of many happy users: http://www.fastmail.fm/docs/quotes.html From Bowerbird at aol.com Fri Mar 28 19:40:34 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 28 Mar 2008 22:40:34 EDT Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs, all typos are shallow" Message-ID: paul said: > A break from the bowerbird mass flood: > http://www.shirky.com/herecomeseverybody/2008/03/given-enough-eyeballs-all-typo.html paul, you really need to engage your brain before posting. shirky is saying _exactly_ the same thing as i'm saying, except my "mass flood" is because i'm providing _data_ that _proves_ what i'm saying, rather than just spouting a title for a blog-entry based on a personal experience. -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080328/73283582/attachment.htm From paulmaas at airpost.net Fri Mar 28 20:32:51 2008 From: paulmaas at airpost.net (Paul Maas) Date: Fri, 28 Mar 2008 20:32:51 -0700 Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs, all typos are shallow" In-Reply-To: References: Message-ID: <1206761571.18076.1244905483@webmail.messagingengine.com> Shaking my head on this. What the hell is wrong with you? On Fri, 28 Mar 2008 22:40:34 EDT, Bowerbird at aol.com said: > paul said: > > A break from the bowerbird mass flood: > > > http://www.shirky.com/herecomeseverybody/2008/03/given-enough-eyeballs-all-typo.html > > paul, you really need to engage your brain before posting. > > shirky is saying _exactly_ the same thing as i'm saying, > except my "mass flood" is because i'm providing _data_ > that _proves_ what i'm saying, rather than just spouting > a title for a blog-entry based on a personal experience. > > -bowerbird > > > > ************** > Create a Home Theater Like the Pros. Watch the video on AOL > Home. > (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& > ncid=aolhom00030000000001) -- Paul Maas paulmaas at airpost.net -- http://www.fastmail.fm - Email service worth paying for. Try it for free From paulmaas at airpost.net Fri Mar 28 20:37:35 2008 From: paulmaas at airpost.net (Paul Maas) Date: Fri, 28 Mar 2008 20:37:35 -0700 Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs, all typos are shallow" In-Reply-To: References: Message-ID: <1206761855.18679.1244905587@webmail.messagingengine.com> Also, you certainly are providing data, but why not complete your research, then post your summary? All I see is a huge deluge of raw data that's best described as spam. Are you trying to convince us because you write these unbelievably long messages? On Fri, 28 Mar 2008 22:40:34 EDT, Bowerbird at aol.com said: > paul said: > > A break from the bowerbird mass flood: > > > http://www.shirky.com/herecomeseverybody/2008/03/given-enough-eyeballs-all-typo.html > > paul, you really need to engage your brain before posting. > > shirky is saying _exactly_ the same thing as i'm saying, > except my "mass flood" is because i'm providing _data_ > that _proves_ what i'm saying, rather than just spouting > a title for a blog-entry based on a personal experience. > > -bowerbird > > > > ************** > Create a Home Theater Like the Pros. Watch the video on AOL > Home. > (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& > ncid=aolhom00030000000001) -- Paul Maas paulmaas at airpost.net -- http://www.fastmail.fm - Same, same, but different From marcello at perathoner.de Sat Mar 29 00:16:37 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 29 Mar 2008 08:16:37 +0100 Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs, all typos are shallow" In-Reply-To: <1206761855.18679.1244905587@webmail.messagingengine.com> References: <1206761855.18679.1244905587@webmail.messagingengine.com> Message-ID: <47EDECD5.6090606@perathoner.de> Paul Maas wrote: > Also, you certainly are providing data, but why not complete > your research, then post your summary? All I see is a huge > deluge of raw data that's best described as spam. Are you > trying to convince us because you write these unbelievably > long messages? BB is a social inept troglodyte with way too much time on his hands. *He* thinks he is a genius because he knows how to convert his social security check into money. Everybody else thinks he is a crank. His tendency of writing longer and longer nonsense in the hope of enticing somebody into a fight is a direct consequence of everybody else having him killfiled. Just do the same. Whats wrong with BB? See: http://www.gnutenberg.de/bowerbird/ -- Marcello Perathoner webmaster at gutenberg.org From hart at pglaf.org Sat Mar 29 06:56:56 2008 From: hart at pglaf.org (Michael Hart) Date: Sat, 29 Mar 2008 06:56:56 -0700 (PDT) Subject: [gutvol-d] PG's #25,000 Message-ID: In a couple weeks we will be coming up on eBook #25,000 in our numbering cycle. If anyone has any suggestions for #25,000. . . . Michael From paulmaas at airpost.net Sat Mar 29 08:26:31 2008 From: paulmaas at airpost.net (Paul Maas) Date: Sat, 29 Mar 2008 08:26:31 -0700 Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs, all typos are shallow" In-Reply-To: <47EDECD5.6090606@perathoner.de> References: <1206761855.18679.1244905587@webmail.messagingengine.com> <47EDECD5.6090606@perathoner.de> Message-ID: <1206804391.28523.1244951345@webmail.messagingengine.com> Wow, bowerbirdy is definitely a social misfit. Your advice to ignore him is excellent. It'd be cool if this group's listserver allowed one to kill-file at the source. This way the mail is never sent. With this system each subscriber can query the maillist application to see how much he/she is kill-filed. In the case of bowerbirdy, with such a capability, no doubt 95% of all subscribers would flip the switch on him once told how to do it. He'd probably quit posting here since he'd know no-one is listening to him. I also believe this group's archive is not indexed by Google and other search engines, so what he posts here is not even findable. I'm amazed he continues posting here instead of in a Google indexed blog. This group is almost like a black hole of information exchange. I wonder why the archive is not open so it can be indexed? Maybe the owners of this group are embarassed by posters like bowerbirdy. "Can't kick them off since that goes against our principles, so let's close the archive to the public so no one can see the garbage being posted here." Makes sense to me. On Sat, 29 Mar 2008 08:16:37 +0100, "Marcello Perathoner" > BB is a social inept troglodyte with way too much time on his hands. > *He* thinks he is a genius because he knows how to convert his social > security check into money. Everybody else thinks he is a crank. > > His tendency of writing longer and longer nonsense in the hope of > enticing somebody into a fight is a direct consequence of everybody else > having him killfiled. > > Just do the same. > > > Whats wrong with BB? See: > > http://www.gnutenberg.de/bowerbird/ > > > -- > Marcello Perathoner > webmaster at gutenberg.org > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d -- Paul Maas paulmaas at airpost.net -- http://www.fastmail.fm - A fast, anti-spam email service. From Bowerbird at aol.com Sat Mar 29 11:30:58 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 29 Mar 2008 14:30:58 EDT Subject: [gutvol-d] the living room of the project gutenberg library Message-ID: one big reason why i post on this listserve is because it's the living room of the project gutenberg library, and i consider that to be a neat place to hang out... michael hart is one of my big heroes... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080329/b78825a7/attachment.htm From ajhaines at shaw.ca Sat Mar 29 15:44:03 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sat, 29 Mar 2008 15:44:03 -0700 Subject: [gutvol-d] the living room of the project gutenberg library References: Message-ID: <001401c891ee$634e29d0$6501a8c0@ahainesp2400> The library itself would be even neater, and you'd be paying a compliment to Michael, if you produced some (more) books. In honour of the upcoming 25000th assigned etext number, how about producing 25 books over the remainder of 2008? Consider this a challenge--divert some of that time and energy you use in critiquing DP, and produce 25 ebooks, their titles not currently in PG, by yourself, outside of DP, starting from real books of, say, 250 pages or more each (not scansets from Internet Archive, Google Books, or similar), using your tools and techniques, and have them posted in PG, by the end of 2008. ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Saturday, March 29, 2008 11:30 AM Subject: [gutvol-d] the living room of the project gutenberg library one big reason why i post on this listserve is because it's the living room of the project gutenberg library, and i consider that to be a neat place to hang out... michael hart is one of my big heroes... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&ncid=aolhom00030000000001) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080329/7c5e32cc/attachment.htm From nwolcott2ster at gmail.com Sun Mar 30 09:19:10 2008 From: nwolcott2ster at gmail.com (Norm Wolcott) Date: Sun, 30 Mar 2008 11:19:10 -0500 Subject: [gutvol-d] Googles denial of service messages Message-ID: <002001c89281$d8776940$660fa8c0@atlanticbb.net> I was poking around on books.google.com yesterday. I was doing some searches, looking at some of the results, saing the about page and downloading the pdf for the ones I was interested in. Then I got the "alphabet-soup" messae from Google saying I was acting similar to a robot. I was given the option to continue if I could read the 5 alpha characters in the box. I continued for one or 2 times then got the alphabet sop message again, this time asking me to identify the letters in two successive boxes. Then a couple of actions later (just using the back key) I got the denial of service mesage reminding me that bots were violating their terms of service. It is not clear if I am on their permanent s--- list, or on probabation for aa month or so. In any event it appears that they are monitoring all of my individual IP address, my router's ID address, and my cable modem's IP address. I tried using another comoputer on the same router, when I got the alphabet soup messaage almost immediately I did not proceed to the denial of service message. It is obvious that since my transgressions were purely random that I was not being tracked for being a bot but for using the site too much. I don't know if being logged iinto their site hurt or not, I like to add books to "my Library", but maybe being on their list sets you up as a problem. Is there any way I cne control or change one or all of these IP addresses (and incidentally monitor what they are ) so that I can even the playing field with google? I saw a page somewhere where you could "assign a permanent IP address". One of my problems is that with the usual setup the IP addresses are autoomatically assigned. Thus they are always the first available on the list which sets me up for google's tracker. Evenchanging my local IP address gives me only 256 choices I think, and that will not be much hep if they are monitoriing the other two. I hope some of you experts out there can offer some suggestions to fighting back with Uncle Google. Of course google will be monitoring my email here as well, but maybe I should refer to them as "foogle" or maybe you have a better suggestion. nwolcott2 at post.harvard.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080330/dfc1423f/attachment.htm From Bowerbird at aol.com Sun Mar 30 11:30:06 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 30 Mar 2008 14:30:06 EDT Subject: [gutvol-d] Googles denial of service messages Message-ID: norm said: > I hope some of you experts out there > can offer some suggestions to fighting back with Uncle Google. > Of course google will be monitoring my email here as well, > but maybe I should refer to them as "foogle" > or maybe you have a better suggestion. um, i'm not an expert at this stuff, not by any means. all that i.p. gobbledygook confuses me immensely... but i do have a suggestion. starting with a question: why do you consider a need to "fight back"? i'd think this is a simple misunderstanding, not a "fight". instead of asking us what to do, write directly to google. (yeah, i know that's easier said than done, but just try it.) although you characterize your use as innocent, and i do believe you, i'm sure it's also "heavy" use, and looks like it, so it's not all that surprising it might look bot-like to them. but if you explain the situation, maybe they would make an adjustment concerning your i.p. address that allowed you to exercise your typical heavy usage without tripping their wires. in the long run that'd be far better than gaming i.p. addresses. it would also give useful information to us other heavy users... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080330/d6671fd2/attachment.htm From marcello at perathoner.de Sun Mar 30 12:17:18 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun, 30 Mar 2008 21:17:18 +0200 Subject: [gutvol-d] Googles denial of service messages In-Reply-To: <002001c89281$d8776940$660fa8c0@atlanticbb.net> References: <002001c89281$d8776940$660fa8c0@atlanticbb.net> Message-ID: <47EFE73E.3000004@perathoner.de> Norm Wolcott wrote: > I hope some of you experts out there can offer some suggestions to > fighting back with Uncle Google. Of course google will be monitoring > my email here as well, but maybe I should refer to them as "foogle" > or maybe you have a better suggestion. nwolcott2 at post.harvard.edu First, this is no denial of service. Google is a commercial enterprise, they are offering this service at a considerable cost, so they get to make the rules. If you don't like Google, don't use it. Second, the only IP Google can see is your router's / modem's IP. The configuration of your internal LAN is completely irrelevant. Your best bet is to power cycle your router / modem to get a new IP from your provider. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Mon Mar 31 10:44:57 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 31 Mar 2008 13:44:57 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 14 Message-ID: ok, i checked up on juliet's contention that some projects over at d.p. are auto-dehyphenated during preprocessing, and have found that to be the case. in a random check of a good number, i found that many went into p1 dehyphenated. indeed, although i still found some that had not been, a clear _majority_ had been done. i salute this as one very real step of progress by d.p. this makes big_bill's bellowing even more inexplicable... (and he's continued, even exacerbated, that bellowing.) does he proof? doesn't he know about this development? at any rate, the fact also remains that _none_ of the test books which d.p. has been running in these experiments was auto-dehyphenated. neither were any subjected to other preprocessing that should have become "standard" over at d.p. a long time ago, like automatically closing up spacey punctuation, and spacey contractions (like "we 're"). considering that there can be _hundreds_ of such entities, even thousands in a typical book, this lack is unforgivable. evidently the content providers of these books don't know that they should be doing preprocessing on their projects, rather than dumping thousands of _unnecessary_changes_ on the p1 proofers... nonetheless, i do give credit when a positive step is taken, and autodehyphenation is a positive step, so credit given... (and yes, dehyphenation at such an early stage still remains _the_wrong_policy_. but if you're gonna have proofers do it then, you might as well have the computer do it then instead.) *** back to our analysis of the data in the parallel proofing test. so, were you surprised to learn that the normal p1 proofers took 6,400+ lines to perfection, with only _161_ not perfect? yep, p2 and p3 on this project only changed 161 lines: > http://z-m-l.com/go/paulp/paul-p1-p3-161changes.html all of the rest of the lines were evidently perfect after p1. (for a comparison, i estimate p1 fixed about 1,750 lines.) *** and even though a mere 161 imperfect lines out of 6,500+ is an amazingly high rate of quality, closer analysis of those 161 bad lines suggests most could've been _autodetected_, meaning they should've been fixed during _preprocessing_, before they were ever even presented to volunteers to proof. to see how i categorized the 161 changed lines, look here: > http://z-m-l.com/go/paulp/paul-161-categorize.html well, by my count, all but _37_ could've been autodetected. (my earlier guess of _40_ ending up being pretty accurate.) 37 errors is still too many for a book with some 200+ pages, even to send the text to the public for "continuous proofing", but it's one heck of a great performance for _p1_ to turn in... *** and i wasn't done yet... i then categorized the 37 remaining lines, and put it here: > http://z-m-l.com/go/paulp/paul-37-not-autodetectable.html i have also appended it to this post, for your convenience... these difficult-to-detect 37 lines break down like this: -> stealth scannos -- 15 -> missing words --13 -> punctuation problems -- 9 stealth scannos, of course, are the prototype of hard-to-detect. and there seemed to be a _lot_ of stealth scannos in this book... but wait. maybe that's because _tesseract_ was used for o.c.r.? it's worth a look... so i did an easy test for that... i just looked to see how many of the 15 stealth scannos which had persisted through p1 were present in the o.c.r. by _abbyy_. wow. 1 of them was caused by a bad scan, but the other _14_ were present in the first place only because tesseract was used, instead of abbyy finereader, which everyone knows is superior. that is, none of these stealth scannos existed in the abbyy o.c.r. so all 15 cases i categorized as "stealth scannos" were attributable to _incompetence_by_the_content_provider_, and _not_ proofers! wow. so i took a look at the "missing words" category as well, since many of those could clearly be attributed to the bad scans too, and i wondered how many of the others were due to tesseract. again, the results were very stark. 5 were due to bad scans. that is, the missing words were on the left side of the page, in the area where the page had been blurred by gutter noise. but all of the other 8 cases -- i.e., missing words mid-line -- were present in tesseract only; abbyy had recognized it fine... ao again, all 13 of the 13 cases in my "missing words" category were because of the content provider, not caused by proofers. yes, the proofers let these errors through, but if they wouldn't have been present originally -- i.e., if abbyy had been used -- then they wouldn't have been present at the end of p1 either... so let's go to our last category, which is "punctuation errors"... here the story is a little less clear, and a little more convoluted. only 1 error was due to tesseract, and 2 due to the bad scans. 2 were due to a dehyphenation error, which don't count against the proofers. that leaves just _4_ errors that were p1 mistakes... 049.png -- =p1=> "This book was illuminated, bound, and 049.png -- =p3=> "'This book was illuminated, bound, and ***p1 mistake 049.png -- diff> =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^ 059.png -- =p1=> Cameron." Call them up this minute and nail 059.png -- =p3=> Cameron. "Call them up this minute and nail ***p1 mistake 059.png -- diff> ========^^===================== 205.png -- =p1=> and two of Burminghams graduates 205.png -- =p3=> and two of Burmingham's graduates ***p1 mistake 205.png -- diff> =====================^^^^^^^^^^^ 227.png -- =p1=> the five hundredth-time Don had been caught 227.png -- =p3=> the five hundredth -- time Don had been caught ***p1 mistake 227.png -- diff> ==================^^^^^^^^^^^^^^^^^^^^^^^^^ (and i'm not gonna look too closely at those 4, because a second glance now indicates that a couple of them, and maybe all 4, aren't p1 mistakes after all.) the list of 37 -- appended, and at the u.r.l. above -- shows how i categorized each of the 37 lines in terms of their being a result of bad scans, tesseract, etc. for example, the first one, which involves a missing comma after the word "pay", was due to a bad scan, which cut off that word (and the comma that followed)... *** ok, let me summarize, because this conclusion is remarkable and startling. on this project, _if_ we would have had good scans to begin with, which is certainly _not_ an unreasonable thing to ask, and _if_ we would have had the o.c.r. done with abbyy, which is again _not_ unreasonable to expect, and _if_ we had done a good job of preprocessing this text (and/or done a good job of cleaning _after_ it had came out of p1), which is _also_not_ an unreasonable expectation that we should have, _then_ the p1 proofers would have taken all but _4_lines_ of the _6,500+_lines_ to _perfection_... p1 -- and just one round of p1 at that -- took this book to near-perfection. you can't see it well, because this perfection sits smack-dab in the middle of literally _hundreds_ of errors "injected" by an incompetent content provider, and literally _hundreds_ of meaningless and unnecessary changes, but if you clear away the senseless underbrush, there's a sparkling diamond underneath. in one short sentence, the p1 proofers are _awesome_. -bowerbird p.s. the 37 bad lines out of p1 (out of 161) which were _not_ autodetectable... stealth 003 -- 017.png -- =p1=> "Enough to till a good-sized daily, I should 003 -- 017.png -- =p3=> "Enough to fill a good-sized daily, I should ***abbyy 003 -- 017.png -- diff> ===========^================================ 004 -- 018.png -- =p1=> expensive piece of property, my son," he relied. 004 -- 018.png -- =p3=> expensive piece of property, my son," he replied. ***scan 004 -- 018.png -- diff> ===========================================^^^^^ 033 -- 063.png -- =p1=> the judge mischievously. "It you boys propose 033 -- 063.png -- =p3=> the judge mischievously. "If you boys propose ***abbyy 033 -- 063.png -- diff> ===========================^================= 087 -- 160.png -- =p1=> Paul lingered the bill nervously. Fifty dollars! 087 -- 160.png -- =p3=> Paul fingered the bill nervously. Fifty dollars! ***abbyy 087 -- 160.png -- diff> =====^========================================== 090 -- 164.png -- =p1=> money and government notes are line examples 090 -- 164.png -- =p3=> money and government notes are fine examples ***abbyy 090 -- 164.png -- diff> ===============================^============ 094 -- 176.png -- =p1=> press rooms for striking oil proof when the 094 -- 176.png -- =p3=> press rooms for striking off proof when the ***abbyy 094 -- 176.png -- diff> ==========================^^=============== 103 -- 187.png -- =p1=> This east is then fitted upon the rollers 103 -- 187.png -- =p3=> This cast is then fitted upon the rollers ***abbyy 103 -- 187.png -- diff> =====^=================================== 106 -- 190.png -- =p1=> a small space allowed it; N, too, is not much in 106 -- 190.png -- =p3=> a small space allowed it; X, too, is not much in ***abbyy 106 -- 190.png -- diff> ==========================^===================== 108 -- 192.png -- =p1=> large metal sections that lit on the two halves of 108 -- 192.png -- =p3=> large metal sections that fit on the two halves of ***abbyy 108 -- 192.png -- diff> ==========================^======================= 122 -- 206.png -- =p1=> bid good-by to the familiar balls of the school, 122 -- 206.png -- =p3=> bid good-by to the familiar halls of the school, ***abbyy 122 -- 206.png -- diff> ============================^=================== 141 -- 225.png -- =p1=> hollowing them out and tilling them up again 141 -- 225.png -- =p3=> hollowing them out and filling them up again ***abbyy 141 -- 225.png -- diff> =======================^==================== 143 -- 226.png -- =p1=> loyally refusing to peach on his churns. That 143 -- 226.png -- =p3=> loyally refusing to peach on his chums. That ***abbyy 143 -- 226.png -- diff> ====================================^^^^^^^^^ 145 -- 227.png -- =p1=> "They say there always has to be a fist time. 145 -- 227.png -- =p3=> "They say there always has to be a first time. ***abbyy 145 -- 227.png -- diff> =====================================^^^^^^^^ 152 -- 234.png -- =p1=> "And I oh yours, Mr. Carter. Melville is a 152 -- 234.png -- =p3=> "And I on yours, Mr. Carter. Melville is a ***abbyy 152 -- 234.png -- diff> ========^================================= 159 -- 237.png -- =p1=> course, the far-tamed March Hare. Its advent 159 -- 237.png -- =p3=> course, the far-famed March Hare. Its advent ***abbyy 159 -- 237.png -- diff> ================^=========================== missing/excess words -- hard to detect 014 -- 036.png -- =p1=> of being shrewd, close-fisted, and 014 -- 036.png -- =p3=> reputation of being shrewd, close-fisted, and ***scan 014 -- 036.png -- diff> ^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^^ 021 -- 046.png -- =p1=> kings, bishops, and persons of rank could 021 -- 046.png -- =p3=> many kings, bishops, and persons of rank could ***scan 021 -- 046.png -- diff> ^^=^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^ 026 -- 051.png -- =p1=> and were sold to of the Church or to 026 -- 051.png -- =p3=> and were sold to dignitaries of the Church or to ***tess 026 -- 051.png -- diff> =================^^^^^^^^^^^^^^^^^^^ 028 -- 053.png -- =p1=> ??line missing here...?? 028 -- 053.png -- =p3=> "Yes." ***tess 028 -- 053.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^ 038 -- 070.png -- =p1=> I have already explained, care much for reading; 038 -- 070.png -- =p3=> have already explained, care much for reading; ***scan 038 -- 070.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^ 051 -- 084.png -- =p1=> a under the ropes." 051 -- 084.png -- =p3=> under the ropes." ***scan 051 -- 084.png -- diff> ^^^^^^^^^^^^^^^^^^^ 056 -- 090.png -- =p1=> is public that desired to read -- which this one did 056 -- 090.png -- =p3=> public that desired to read -- which this one did ***scan 056 -- 090.png -- diff> ^^^^^^^^^^=^^^^^^^^^^^=^^^^^^^=^^^^=^^=^^^^^^^^^^^^^ 057 -- 092.png -- =p1=> More than one dignified resident of town struggled 057 -- 092.png -- =p3=> More than one dignified resident of the town struggled ***tess 057 -- 092.png -- diff> =====================================^^^^^^^^^^^^^ 064 -- 119.png -- =p1=> author the prey of vultures who 064 -- 119.png -- =p3=> author was the prey of vultures who ***tess 064 -- 119.png -- diff> =======^^^=^^=^^^^^^^^^^^^^^^^^ 101 -- 186.png -- =p1=> their days." 101 -- 186.png -- =p3=> their days. "I'm going to take you upstairs ***tess 101 -- 186.png -- diff> ===========^ 102 -- 186.png -- =p1=> Mr. Hawley said briskly. "We may 102 -- 186.png -- =p3=> first," Mr. Hawley said briskly. "We may ***tess 102 -- 186.png -- diff> ^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^ 116 -- 202.png -- =p1=> publishers." I 116 -- 202.png -- =p3=> publishers." ***tess 116 -- 202.png -- diff> ============^^ 124 -- 209.png -- =p1=> "Because -- well-it Would be so yellow," 124 -- 209.png -- =p3=> "Because -- well -- it would be so darn yellow," ***tess 124 -- 209.png -- diff> ================^^^=^^^^^^^^=^^=^^^^^^^^ punctuation -- hard to detect 016 -- 038.png -- =p1=> pay too." 016 -- 038.png -- =p3=> pay, too." ***scan 016 -- 038.png -- diff> ===^^^=^^ 023 -- 049.png -- =p1=> "This book was illuminated, bound, and 023 -- 049.png -- =p3=> "'This book was illuminated, bound, and ***p1 mistake 023 -- 049.png -- diff> =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^ 032 -- 059.png -- =p1=> Cameron." Call them up this minute and nail 032 -- 059.png -- =p3=> Cameron. "Call them up this minute and nail ***p1 mistake 032 -- 059.png -- diff> ========^^================================= 060 -- 096.png -- =p1=> was the principle of it is identical with that 060 -- 096.png -- =p3=> was, the principle of it is identical with that ***scan 060 -- 096.png -- diff> ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 114 -- 201.png -- =p1=> periodicals, "Mr. Hawley managed to shout 114 -- 201.png -- =p3=> periodicals," Mr. Hawley managed to shout ***tess 114 -- 201.png -- diff> ============^^=========================== 121 -- 205.png -- =p1=> and two of Burminghams graduates 121 -- 205.png -- =p3=> and two of Burmingham's graduates ***p1 mistake 121 -- 205.png -- diff> =====================^^^^^^^^^^^ 144 -- 227.png -- =p1=> the five hundredth-time Don had been caught 144 -- 227.png -- =p3=> the five hundredth -- time Don had been caught ***p1 mistake 144 -- 227.png -- diff> ==================^^^^^^^^^^^^^^^^^^^^^^^^^ 153 -- 235.png -- =p1=> In fact," he continued, lapsing into seriousness," 153 -- 235.png -- =p3=> In fact," he continued, lapsing into seriousness, ***dehyphenation 153 -- 235.png -- diff> =================================================^ 154 -- 235.png -- =p1=> the younger generation teaches us 154 -- 235.png -- =p3=> "the younger generation teaches us ***dehyphenation 154 -- 235.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080331/4ba8f62f/attachment-0001.htm From ajhaines at shaw.ca Mon Mar 31 11:27:30 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Mon, 31 Mar 2008 11:27:30 -0700 Subject: [gutvol-d] parallel -- paul and the printing press -- 14 References: Message-ID: <000f01c8935c$e17c8e90$6b01a8c0@ahainesp2400> See PG FAQ V.105 for its discussion of spacey contractions. It's allowed that they get closed up, but it's up to the volunteer, so it's probably better to not close them up automatically. I've done books where the OCR spaced some contractions and not others, and it wasn't easy to tell from the book whether the contractions were meant to be spaced or not. If spacing was obvious, I went with it; if not, I didn't. In either case, I went with consistency--if most were not spaced, I despaced any others, and vice versa. I've also encountered in a book (but only once): "wasn 't". Since this is obviously wrong, whether the fault of the author or the typesetter, it got despaced. Contractions can be spaced away from their companion word, but they *cannot" themselves be split. On a separate note, I notice my challenge to bowerbird (see the "the living room of the project gutenberg library" thread) has gone unanswered. Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Monday, March 31, 2008 10:44 AM Subject: [gutvol-d] parallel -- paul and the printing press -- 14 ok, i checked up on juliet's contention that some projects over at d.p. are auto-dehyphenated during preprocessing, and have found that to be the case. in a random check of a good number, i found that many went into p1 dehyphenated. indeed, although i still found some that had not been, a clear _majority_ had been done. i salute this as one very real step of progress by d.p. this makes big_bill's bellowing even more inexplicable... (and he's continued, even exacerbated, that bellowing.) does he proof? doesn't he know about this development? at any rate, the fact also remains that _none_ of the test books which d.p. has been running in these experiments was auto-dehyphenated. neither were any subjected to other preprocessing that should have become "standard" over at d.p. a long time ago, like automatically closing up spacey punctuation, and spacey contractions (like "we 're"). considering that there can be _hundreds_ of such entities, even thousands in a typical book, this lack is unforgivable. evidently the content providers of these books don't know that they should be doing preprocessing on their projects, rather than dumping thousands of _unnecessary_changes_ on the p1 proofers... nonetheless, i do give credit when a positive step is taken, and autodehyphenation is a positive step, so credit given... (and yes, dehyphenation at such an early stage still remains _the_wrong_policy_. but if you're gonna have proofers do it then, you might as well have the computer do it then instead.) *** back to our analysis of the data in the parallel proofing test. so, were you surprised to learn that the normal p1 proofers took 6,400+ lines to perfection, with only _161_ not perfect? yep, p2 and p3 on this project only changed 161 lines: > http://z-m-l.com/go/paulp/paul-p1-p3-161changes.html all of the rest of the lines were evidently perfect after p1. (for a comparison, i estimate p1 fixed about 1,750 lines.) *** and even though a mere 161 imperfect lines out of 6,500+ is an amazingly high rate of quality, closer analysis of those 161 bad lines suggests most could've been _autodetected_, meaning they should've been fixed during _preprocessing_, before they were ever even presented to volunteers to proof. to see how i categorized the 161 changed lines, look here: > http://z-m-l.com/go/paulp/paul-161-categorize.html well, by my count, all but _37_ could've been autodetected. (my earlier guess of _40_ ending up being pretty accurate.) 37 errors is still too many for a book with some 200+ pages, even to send the text to the public for "continuous proofing", but it's one heck of a great performance for _p1_ to turn in... *** and i wasn't done yet... i then categorized the 37 remaining lines, and put it here: > http://z-m-l.com/go/paulp/paul-37-not-autodetectable.html i have also appended it to this post, for your convenience... these difficult-to-detect 37 lines break down like this: -> stealth scannos -- 15 -> missing words --13 -> punctuation problems -- 9 stealth scannos, of course, are the prototype of hard-to-detect. and there seemed to be a _lot_ of stealth scannos in this book... but wait. maybe that's because _tesseract_ was used for o.c.r.? it's worth a look... so i did an easy test for that... i just looked to see how many of the 15 stealth scannos which had persisted through p1 were present in the o.c.r. by _abbyy_. wow. 1 of them was caused by a bad scan, but the other _14_ were present in the first place only because tesseract was used, instead of abbyy finereader, which everyone knows is superior. that is, none of these stealth scannos existed in the abbyy o.c.r. so all 15 cases i categorized as "stealth scannos" were attributable to _incompetence_by_the_content_provider_, and _not_ proofers! wow. so i took a look at the "missing words" category as well, since many of those could clearly be attributed to the bad scans too, and i wondered how many of the others were due to tesseract. again, the results were very stark. 5 were due to bad scans. that is, the missing words were on the left side of the page, in the area where the page had been blurred by gutter noise. but all of the other 8 cases -- i.e., missing words mid-line -- were present in tesseract only; abbyy had recognized it fine... ao again, all 13 of the 13 cases in my "missing words" category were because of the content provider, not caused by proofers. yes, the proofers let these errors through, but if they wouldn't have been present originally -- i.e., if abbyy had been used -- then they wouldn't have been present at the end of p1 either... so let's go to our last category, which is "punctuation errors"... here the story is a little less clear, and a little more convoluted. only 1 error was due to tesseract, and 2 due to the bad scans. 2 were due to a dehyphenation error, which don't count against the proofers. that leaves just _4_ errors that were p1 mistakes... 049.png -- =p1=> "This book was illuminated, bound, and 049.png -- =p3=> "'This book was illuminated, bound, and ***p1 mistake 049.png -- diff> =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^ 059.png -- =p1=> Cameron." Call them up this minute and nail 059.png -- =p3=> Cameron. "Call them up this minute and nail ***p1 mistake 059.png -- diff> ========^^===================== 205.png -- =p1=> and two of Burminghams graduates 205.png -- =p3=> and two of Burmingham's graduates ***p1 mistake 205.png -- diff> =====================^^^^^^^^^^^ 227.png -- =p1=> the five hundredth-time Don had been caught 227.png -- =p3=> the five hundredth -- time Don had been caught ***p1 mistake 227.png -- diff> ==================^^^^^^^^^^^^^^^^^^^^^^^^^ (and i'm not gonna look too closely at those 4, because a second glance now indicates that a couple of them, and maybe all 4, aren't p1 mistakes after all.) the list of 37 -- appended, and at the u.r.l. above -- shows how i categorized each of the 37 lines in terms of their being a result of bad scans, tesseract, etc. for example, the first one, which involves a missing comma after the word "pay", was due to a bad scan, which cut off that word (and the comma that followed)... *** ok, let me summarize, because this conclusion is remarkable and startling. on this project, _if_ we would have had good scans to begin with, which is certainly _not_ an unreasonable thing to ask, and _if_ we would have had the o.c.r. done with abbyy, which is again _not_ unreasonable to expect, and _if_ we had done a good job of preprocessing this text (and/or done a good job of cleaning _after_ it had came out of p1), which is _also_not_ an unreasonable expectation that we should have, _then_ the p1 proofers would have taken all but _4_lines_ of the _6,500+_lines_ to _perfection_... p1 -- and just one round of p1 at that -- took this book to near-perfection. you can't see it well, because this perfection sits smack-dab in the middle of literally _hundreds_ of errors "injected" by an incompetent content provider, and literally _hundreds_ of meaningless and unnecessary changes, but if you clear away the senseless underbrush, there's a sparkling diamond underneath. in one short sentence, the p1 proofers are _awesome_. -bowerbird p.s. the 37 bad lines out of p1 (out of 161) which were _not_ autodetectable... stealth 003 -- 017.png -- =p1=> "Enough to till a good-sized daily, I should 003 -- 017.png -- =p3=> "Enough to fill a good-sized daily, I should ***abbyy 003 -- 017.png -- diff> ===========^================================ 004 -- 018.png -- =p1=> expensive piece of property, my son," he relied. 004 -- 018.png -- =p3=> expensive piece of property, my son," he replied. ***scan 004 -- 018.png -- diff> ===========================================^^^^^ 033 -- 063.png -- =p1=> the judge mischievously. "It you boys propose 033 -- 063.png -- =p3=> the judge mischievously. "If you boys propose ***abbyy 033 -- 063.png -- diff> ===========================^================= 087 -- 160.png -- =p1=> Paul lingered the bill nervously. Fifty dollars! 087 -- 160.png -- =p3=> Paul fingered the bill nervously. Fifty dollars! ***abbyy 087 -- 160.png -- diff> =====^========================================== 090 -- 164.png -- =p1=> money and government notes are line examples 090 -- 164.png -- =p3=> money and government notes are fine examples ***abbyy 090 -- 164.png -- diff> ===============================^============ 094 -- 176.png -- =p1=> press rooms for striking oil proof when the 094 -- 176.png -- =p3=> press rooms for striking off proof when the ***abbyy 094 -- 176.png -- diff> ==========================^^=============== 103 -- 187.png -- =p1=> This east is then fitted upon the rollers 103 -- 187.png -- =p3=> This cast is then fitted upon the rollers ***abbyy 103 -- 187.png -- diff> =====^=================================== 106 -- 190.png -- =p1=> a small space allowed it; N, too, is not much in 106 -- 190.png -- =p3=> a small space allowed it; X, too, is not much in ***abbyy 106 -- 190.png -- diff> ==========================^===================== 108 -- 192.png -- =p1=> large metal sections that lit on the two halves of 108 -- 192.png -- =p3=> large metal sections that fit on the two halves of ***abbyy 108 -- 192.png -- diff> ==========================^======================= 122 -- 206.png -- =p1=> bid good-by to the familiar balls of the school, 122 -- 206.png -- =p3=> bid good-by to the familiar halls of the school, ***abbyy 122 -- 206.png -- diff> ============================^=================== 141 -- 225.png -- =p1=> hollowing them out and tilling them up again 141 -- 225.png -- =p3=> hollowing them out and filling them up again ***abbyy 141 -- 225.png -- diff> =======================^==================== 143 -- 226.png -- =p1=> loyally refusing to peach on his churns. That 143 -- 226.png -- =p3=> loyally refusing to peach on his chums. That ***abbyy 143 -- 226.png -- diff> ====================================^^^^^^^^^ 145 -- 227.png -- =p1=> "They say there always has to be a fist time. 145 -- 227.png -- =p3=> "They say there always has to be a first time. ***abbyy 145 -- 227.png -- diff> =====================================^^^^^^^^ 152 -- 234.png -- =p1=> "And I oh yours, Mr. Carter. Melville is a 152 -- 234.png -- =p3=> "And I on yours, Mr. Carter. Melville is a ***abbyy 152 -- 234.png -- diff> ========^================================= 159 -- 237.png -- =p1=> course, the far-tamed March Hare. Its advent 159 -- 237.png -- =p3=> course, the far-famed March Hare. Its advent ***abbyy 159 -- 237.png -- diff> ================^=========================== missing/excess words -- hard to detect 014 -- 036.png -- =p1=> of being shrewd, close-fisted, and 014 -- 036.png -- =p3=> reputation of being shrewd, close-fisted, and ***scan 014 -- 036.png -- diff> ^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^^ 021 -- 046.png -- =p1=> kings, bishops, and persons of rank could 021 -- 046.png -- =p3=> many kings, bishops, and persons of rank could ***scan 021 -- 046.png -- diff> ^^=^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^ 026 -- 051.png -- =p1=> and were sold to of the Church or to 026 -- 051.png -- =p3=> and were sold to dignitaries of the Church or to ***tess 026 -- 051.png -- diff> =================^^^^^^^^^^^^^^^^^^^ 028 -- 053.png -- =p1=> ??line missing here...?? 028 -- 053.png -- =p3=> "Yes." ***tess 028 -- 053.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^ 038 -- 070.png -- =p1=> I have already explained, care much for reading; 038 -- 070.png -- =p3=> have already explained, care much for reading; ***scan 038 -- 070.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^ 051 -- 084.png -- =p1=> a under the ropes." 051 -- 084.png -- =p3=> under the ropes." ***scan 051 -- 084.png -- diff> ^^^^^^^^^^^^^^^^^^^ 056 -- 090.png -- =p1=> is public that desired to read -- which this one did 056 -- 090.png -- =p3=> public that desired to read -- which this one did ***scan 056 -- 090.png -- diff> ^^^^^^^^^^=^^^^^^^^^^^=^^^^^^^=^^^^=^^=^^^^^^^^^^^^^ 057 -- 092.png -- =p1=> More than one dignified resident of town struggled 057 -- 092.png -- =p3=> More than one dignified resident of the town struggled ***tess 057 -- 092.png -- diff> =====================================^^^^^^^^^^^^^ 064 -- 119.png -- =p1=> author the prey of vultures who 064 -- 119.png -- =p3=> author was the prey of vultures who ***tess 064 -- 119.png -- diff> =======^^^=^^=^^^^^^^^^^^^^^^^^ 101 -- 186.png -- =p1=> their days." 101 -- 186.png -- =p3=> their days. "I'm going to take you upstairs ***tess 101 -- 186.png -- diff> ===========^ 102 -- 186.png -- =p1=> Mr. Hawley said briskly. "We may 102 -- 186.png -- =p3=> first," Mr. Hawley said briskly. "We may ***tess 102 -- 186.png -- diff> ^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^ 116 -- 202.png -- =p1=> publishers." I 116 -- 202.png -- =p3=> publishers." ***tess 116 -- 202.png -- diff> ============^^ 124 -- 209.png -- =p1=> "Because -- well-it Would be so yellow," 124 -- 209.png -- =p3=> "Because -- well -- it would be so darn yellow," ***tess 124 -- 209.png -- diff> ================^^^=^^^^^^^^=^^=^^^^^^^^ punctuation -- hard to detect 016 -- 038.png -- =p1=> pay too." 016 -- 038.png -- =p3=> pay, too." ***scan 016 -- 038.png -- diff> ===^^^=^^ 023 -- 049.png -- =p1=> "This book was illuminated, bound, and 023 -- 049.png -- =p3=> "'This book was illuminated, bound, and ***p1 mistake 023 -- 049.png -- diff> =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^ 032 -- 059.png -- =p1=> Cameron." Call them up this minute and nail 032 -- 059.png -- =p3=> Cameron. "Call them up this minute and nail ***p1 mistake 032 -- 059.png -- diff> ========^^================================= 060 -- 096.png -- =p1=> was the principle of it is identical with that 060 -- 096.png -- =p3=> was, the principle of it is identical with that ***scan 060 -- 096.png -- diff> ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 114 -- 201.png -- =p1=> periodicals, "Mr. Hawley managed to shout 114 -- 201.png -- =p3=> periodicals," Mr. Hawley managed to shout ***tess 114 -- 201.png -- diff> ============^^=========================== 121 -- 205.png -- =p1=> and two of Burminghams graduates 121 -- 205.png -- =p3=> and two of Burmingham's graduates ***p1 mistake 121 -- 205.png -- diff> =====================^^^^^^^^^^^ 144 -- 227.png -- =p1=> the five hundredth-time Don had been caught 144 -- 227.png -- =p3=> the five hundredth -- time Don had been caught ***p1 mistake 144 -- 227.png -- diff> ==================^^^^^^^^^^^^^^^^^^^^^^^^^ 153 -- 235.png -- =p1=> In fact," he continued, lapsing into seriousness," 153 -- 235.png -- =p3=> In fact," he continued, lapsing into seriousness, ***dehyphenation 153 -- 235.png -- diff> =================================================^ 154 -- 235.png -- =p1=> the younger generation teaches us 154 -- 235.png -- =p3=> "the younger generation teaches us ***dehyphenation 154 -- 235.png -- diff> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&ncid=aolhom00030000000001) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080331/85edd045/attachment-0001.htm From Bowerbird at aol.com Mon Mar 31 12:43:21 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 31 Mar 2008 15:43:21 EDT Subject: [gutvol-d] parallel -- paul and the printing press -- 14 Message-ID: al said: > See PG FAQ V.105 for its discussion of spacey contractions.? um, yeah... um, no... > It's allowed that they get closed up, but it's up to the volunteer, > so it's probably better to not close them up automatically. see, this is why the p.g. rules don't mean much to me. there are far too many things left "up to the volunteer". that means p.g. has become a mere collection of works -- many of which have inconsistencies with each other -- rather than achieving a coherence making it _a_library_... > I've done books where the OCR spaced some contractions > and not others, and it wasn't easy to tell from the book > whether the contractions were meant to be?spaced or not. well, here's my take on all that, al... last _year_ d.p. digitized 2,345 books. google scans that many books every _day_... ..._before_lunch_... if d.p. -- and p.g. -- want to have _any_ hope of keeping up, (or anything close), it's going to become _necessary_ to stop wasting time sweating differences that don't make a difference. this is one of those differences... spacey contractions "look funny" to today's reader. maybe at some time in the past they had _meaning_ -- probably to indicate a certain pattern of speech -- but whatever it was, it's now largely lost on people, so we need to stop spending time fretting over it... so i have my tool automatically close up contractions, so digitizers can move on to more important things... because i think it's _important_ that we keep up with google. because if we don't, people will soon forget about the various _advantages_ which digital text offers over a plain old scan-set, because there will be so few books (out of millions of scan-sets) for which they actually have the _luxury_ of having digital text. the e-book of the future will become a scan-set, _by_default_... > If spacing was obvious, I went with it; if not, I didn't.? > In either case, I went with consistency--if most were > not spaced, I despaced?any others, and vice versa.? well, if you're following a rule fairly consistently like that, then _that_ could be programmed... but -- to be frank -- none of it really "matters". spacey contractions look funny. so even though you spent all that decision-making time, i'm gonna take your e-text and close up the contractions. just like i turn all the 4-dot ellipses from d.p. into 3-dots. yeah, yeah, i know someone spent lots of time _deciding_ whether they occurred at the end of a sentence, or not... blah. so what? who cares? it's still a freaking _ellipse_, and it still means the same thing, 3-dots or 4-dots, so i'm _sorry_ you wasted your time. _i_ can't be bothered. instead, i'm going to spending my time _productively_, on the things that _do_ make a difference to my readers, and that's why they will use _my_ library and not _yours_. > On a separate note, I notice my challenge to bowerbird > (see the "the living room of the project gutenberg library" > thread) has gone unanswered. not in the slightest. my reply has already grown lengthy, and i'm not even done yet, but it'll be coming along soon. i generally like to hold off long posts during a weekend. the "parallel #14" post i made today was ready on friday. (plus i have a preference for on-topic versus personal... which leads me to ask if you have any _real_ response to the _data_ on _digitization_ that i presented in that post. i mean, the analysis ended with a _startling_ conclusion, so i'd expect that people would have _something_ to say.) but you can expect a reply by today, tomorrow at the latest. one of the things in my reply was my focus on the library _as_ a library, rather than as a "mere" collection of works, so you've gotten an introduction to that point right here... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080331/e7d788b1/attachment.htm From nwolcott2ster at gmail.com Mon Mar 31 13:18:52 2008 From: nwolcott2ster at gmail.com (Norm Wolcott) Date: Mon, 31 Mar 2008 15:18:52 -0500 Subject: [gutvol-d] Googles denial of service messages References: <002001c89281$d8776940$660fa8c0@atlanticbb.net> <47EFE73E.3000004@perathoner.de> Message-ID: <001501c8936c$8dfedba0$660fa8c0@atlanticbb.net> I agree they get to make the rules, which say no automated bots allowed. My random use over a period of a few hours, and I don't know what part of the site they are objecting to my using, was certainly not a bot. And as Bowerbird says conversing with google can be difficult if not impossible. Perhaps if i clilcked on some of their ads I would get better treatment. Maybe ethe "full text" key search bothered them. As to the "they get to make the rules" argument. Yes they should abide by THEIR rules. However they have contracted with verious libraries to allow books to be scanned onto their site, and part of the reasons they were allowed into these libraries was that they would make out of copyright books available to the public. I don't think they ever said to Harvard college that no more than ten books could be downloaded per week/day/year by the public when they pitched their agreement. And also they make no guarantee of the "usefulness" of their scans which are often pretty bad. And what is worse Google offers inducements to use their site such as gmail, "my library", toolbars, etc. And I have never had a denial of service message for using google's general search engine. Obvioiusly they want me to use that as much as I can. And I never subscribed to any "rules of service" agreement. My reason for changing my IP was only that I was being unfiarly hit by google with no opportunity to explain I was not a bot. If they were more congenial there would be no problem. I would note that some libraries have refused to do business with google, and are using Internet Archive instead. And Internet Archive will accept your own scans of public domain works "in the highest resoloution you can provide" and make them available to the public without restrictions. But thanks for the tip on the IP shifts. nwolcott2 at post.harvard.edu ----- Original Message ----- From: "Marcello Perathoner" To: "Project Gutenberg Volunteer Discussion" Sent: Sunday, March 30, 2008 2:17 PM Subject: Re: [gutvol-d] Googles denial of service messages > Norm Wolcott wrote: > > > I hope some of you experts out there can offer some suggestions to > > fighting back with Uncle Google. Of course google will be monitoring > > my email here as well, but maybe I should refer to them as "foogle" > > or maybe you have a better suggestion. nwolcott2 at post.harvard.edu > > First, this is no denial of service. Google is a commercial enterprise, > they are offering this service at a considerable cost, so they get to > make the rules. If you don't like Google, don't use it. > > Second, the only IP Google can see is your router's / modem's IP. The > configuration of your internal LAN is completely irrelevant. Your best > bet is to power cycle your router / modem to get a new IP from your > provider. > > > > -- > Marcello Perathoner > webmaster at gutenberg.org > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Mon Mar 31 15:50:58 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 31 Mar 2008 18:50:58 EDT Subject: [gutvol-d] the living room of the project gutenberg library Message-ID: al said: > The library itself would be even neater, > and you'd be paying a compliment to Michael, > if you?produced some (more) books.? um, well gee, thank you for the suggestion, al... but i think i can decide the best use of my time, and how i will pay my compliments to michael... i do appreciate your _thoughtful_kindness_ in generating the suggestion, however. thanks... i'm extremely grateful to all of the people who digitize books, for project gutenberg and for other projects. they are doing a great service... my energy, however, is better devoted to the question of what happens to books _after_ they have been digitized. how do we create a _cyberlibrary_, and make it more _efficient_? how can we manage the correction of errors? what kind of _viewer-programs_ do we need? what _conversion-tools_ do people require? what kind of other tools should we give 'em? what is the ecosystem in which the e-texts exist, in relation to themselves and to the world at large, and how do we facilitate it? how do we enable users to _remix_ e-texts? and, of course, as i've made it clear by now, how can we make our digitization workflows more efficient, and what tools do we need? i could add little to the thousands of people who are capable of digitizing pages at d.p. on the other hand, those thousands are not capable of doing the things that i can do... they gain their power from their numbers, and it is a remarkable power that they have, and i cheer them loudly for the contribution. but i gain my power from my unique skills, and it is a remarkable power that i have... programmers seem to be scarce in these parts... and have you seen how the d.p. people are approaching the analysis of their own data? it's very apparent to me that they need help. so i'm showing them how to do that analysis. nobody else seems capable of showing them. > Consider this a challenge i've developed my own challenges, thanks. :+) if i put the big one in a phrase, i want to develop a tool that will suck up the results of o.c.r. and -- after presenting some questions to the user so as to resolve any ambiguities -- then spit out a nicely-finished copy of the book as digital text, suitable for mounting for "continuous proofing". so asking me to do 25 books now -- manually -- is a bit like asking henry ford to stop building his assembly-line and put together 25 cars manually. if _my_ "assembly-line" works, al, i will eventually create _25_thousand_ e-texts, maybe 25 million. or maybe the guy that follow me will. or maybe the guy that follows him. or maybe it'll be google. at any rate, whoever makes that assembly-line will put most of you hand-crafters out of the business. but hey, some people still build their cars by hand. > --divert some of that time and energy > you use in critiquing DP see, i don't really think that's a good idea at all. there's a reason i'm spending my time that way. and i think it's _vital_and_imperative_ for me to continue to spend my resources critiquing d.p. the reason is because d.p. is squandering what i believe to be an extremely important resource, namely the good will of well-meaning volunteers. by subjecting proofers to a _massively_inefficient_ workflow, d.p. burns them out unnecessarily, and chases them away, causing long-term loss damage. i know some people might not agree with me on it, but in _my_ view, this is *the* worst problem that confronts the world of volunteer digitization which michael hart created with love so many decades ago. so i'm doing my best to combat that *worst*problem*. (i almost never use *bold*, but there you have it, al...) > and produce 25 ebooks, their titles not currently in PG, > by yourself, outside of DP, starting from real?books of, > say, 250 pages or more each (not scansets gee, al, you mean you don't consider those scan-sets to be "real books"? really? they came from _libraries_... or is it that you think "real men" scan p-books themselves? ;+) > from Internet Archive, Google Books, or similar), personally, i think all of the e-books that are _not_ solidly connected with a scan-set from one of those major scanning projects will eventually be neglected, in favor of digitizations that _can_ be traced to them. the project gutenberg e-texts will be very difficult to compare visually with the major scan-sets, simply due to the fact that you've rewrapped the lines, and thus they'll come to be seen as _unnecessarily_unreliable_, in favor of versions which did _not_ re-wrap the lines. but even digitizations which did _not_ re-wrap lines will be discarded if they can't be linked to a scan-set that can be readily summoned from the big projects. and even if you provide the actual scans that you used, people won't trust you, because "who the heck are you?" if you look around cyberspace even at this early stage, for the classic books there are so many different e-texts floating around that it has become a nightmare to know exactly what you are dealing with, and the problem will only get worse. in order to make things uncomplicated, people will demand that an e-text be closely associated with a scan-set from one of the major scanning projects, and demand that the association can be confirmed easily, by a simple visual comparison of the text with the scans... so, to my mind, the _only_ raw content to use is stuff that comes "from internet archive, google books, or similar"... and that's why i'm glad d.p. is using these more and more. i'd consider it an absolute waste of my time to scan a book; if someone else wants to do it, fine... but i would not do it, not unless there were some book that i just _had_ to have, and all of the books in that category for me are post-1923. so p.g. couldn't use them anyway... > using your tools and techniques, and have them? > posted in PG, by the end of 2008. again, you seem to want me to go fishing. ok, but i, on the other hand, want to teach people to fish... i want to design and build boats, weave fishing nets, study aquaculture, and do research that improves our ecosystems, so we can have our fish and eat them too, because eating fish and fishing wisely makes us smart. i don't have anything against the people who "only" want to _go_ fishing. indeed, i'm trying to help 'em. and i believe i should continue trying to do just that. it's _really_ the very best use of my time and talent... but once again, al, i _do_ appreciate your suggestion... so if you have more of them, please keep 'em coming. or if you think there's more that i should consider, or know about your reason for _this_ suggestion, tell me... -bowerbird ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080331/b1f8dd11/attachment-0001.htm From Bowerbird at aol.com Mon Mar 31 22:18:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 1 Apr 2008 01:18:42 EDT Subject: [gutvol-d] stopping perpetuity -- harder than it looks! Message-ID: take a look at the project page for iteration#6 of "planet strappers": > http://www.pgdp.net/c/project.php?id=projectID47dfd4f82feae it's chugging along, and about half of the pages done have a "diff"... that's right, you heard me correctly, about _half_ the pages! :+) "but how can that be?", you might be asking. "already these pages went through 5 rounds, and there are _still_ changes being made?" yep. sure are. not _corrections_, mind you. just "changes"... meaningless changes... every last one of them meaningless... most of them having to do with ellipses. and these changes appear to have been done by new proofers (who else would tackle a project that has been in the rounds a half-dozen times?) who don't know the rules. (for example they're replacing typos, ones where a note had been left.) heck, one (or more) is even putting spaces _between_ the ellipse dots! right after carlo, in a forum thread, said he had never seen that before. (but -- amazingly -- in strict accordance with the p.g. f.a.q. on ellipses, which has to be one of the most brain-dead p.g. rules devised thus far. spaces between the dots of an ellipses will wreak havoc on any rewrap.) throw in a couple of runarounds on end-line hyphenates as well, with some people inserting hyphens or asterisks, and others removing 'em, and you've got one tasty "error-injection" stew boiling in your pot... this is crazy. i mean, it's an excellent demonstration of what will happen when you have "rules" that are interpreted and reinterpreted differently all the time, and confusing to boot... there are currently _several_ threads running in the d.p. forums dealing with ellipse confusion: > http://www.pgdp.net/phpBB2/viewtopic.php?t=31237 > http://www.pgdp.net/phpBB2/viewtopic.php?t=30521 further, the proofers doing these changes don't seem to realize five rounds of proofers have checked these pages before them. ("and just think, every _one_ of them missed _all_ these ellipses... on one page after the next... really very surprising, that, isn't it?") _hours_ of proofer time were spent bringing you this conclusion. just on this iteration. so far. and it ain't done. but what the heck, it's just _proofer_time_... and that ain't worth as much as peanuts... *** oh, just in case you're wondering... this iteration#6 did _not_ catch the one remaining error, on p#33. we'll have to wait for iteration#7. -bowerbird p.s. however, iteration#6 _did_ find a p-book error that everyone thus far missed... the word "inconveniencies" for "inconveniences". what a shocker! how did everyone else manage to miss that so far? oh, ok, you big spoilsport, dictionary says either one is acceptable. but still, give that proofer a blue ribbon for great eyes trying hard! ************** Create a Home Theater Like the Pros. Watch the video on AOL Home. (http://home.aol.com/diy/home-improvement-eric-stromer?video=15& ncid=aolhom00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/2cdd7d8c/attachment.htm