From gbuchana at teksavvy.com Tue Jul 1 20:20:47 2008 From: gbuchana at teksavvy.com (Gardner Buchanan) Date: Tue, 01 Jul 2008 23:20:47 -0400 Subject: [gutvol-d] continued confusion over at distributed proofreaders In-Reply-To: References: Message-ID: <486AF40F.7050606@teksavvy.com> Bowerbird at aol.com wrote: > > rfrank (roger frank) said: > > If page after page goes by, does a proofer's attention fade? > > I believe it does. > Fade, maybe. Or it might not have been there to begin with. I PM'd a project in which timestamps showed that a good number of pages were proofed in less time that it would have taken to read them normally -- > 10 pages per minute or so -- let alone proofread. I would like to see a system like DP actually _introduce_ a specific known error or two into each page and not accept the page until the proofers had found and corrected it. I want the system to be able to verify that a known level of dilligence is being taken. ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From hyphen at hyphenologist.co.uk Tue Jul 1 23:04:15 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Wed, 2 Jul 2008 07:04:15 +0100 Subject: [gutvol-d] continued confusion over at distributed proofreaders In-Reply-To: <486AF40F.7050606@teksavvy.com> References: <486AF40F.7050606@teksavvy.com> Message-ID: <002701c8dc09$7c543950$74fcabf0$@co.uk> Gardner Buchanan wrote > Bowerbird at aol.com wrote: > > > > rfrank (roger frank) said: > > > If page after page goes by, does a proofer's attention fade? > > > I believe it does. > > Fade, maybe. Or it might not have been there to begin with. > I PM'd a project in which timestamps showed that a good number > of pages were proofed in less time that it would have taken > to read them normally -- > 10 pages per minute or so -- let > alone proofread. > I would like to see a system like DP actually _introduce_ a > specific known error or two into each page and not accept > the page until the proofers had found and corrected it. I > want the system to be able to verify that a known level of > dilligence is being taken. I use a orogrammers editor for the same error on multiple pages. This allows me to edit several dozen pages/chapters at a time. When I find a repeating error and I use "replace all occurances" facility with regular expressions for the tricky changes. Difficult to do with DPs single page system. Dave Fawthrop. From Bowerbird at aol.com Wed Jul 2 00:10:58 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Jul 2008 03:10:58 EDT Subject: [gutvol-d] continued confusion over at distributed proofreaders Message-ID: gardner said: > I want the system to be able to verify that > a known level of dilligence is being taken. the proof is in the pudding. the proofers do an excellent job. if you examine it closely, as i have, you will be continually surprised how well they do. they couldn't do that if they weren't executing with "dilligence". they aren't perfect, but they do accumulate to it quite rapidly. and, frankly, it's just ridiculous for someone like roger frank to imply that the proofers are not paying sufficient attention. if individual proofers wanted to be subjected to injected errors, and informed when they weren't proofing up to par, i would be in favor of that... but it would have to be a _voluntary_ system... to force it on people, who are _volunteers_, would be unseemly, i would think, _especially_ since they _are_ doing such a fine job. (if they were doing a crappy job, that might be another matter...) the problem is not the proofers. the problem is in the workflow. > I use a orogrammers editor for the same error on multiple? pages. > This allows me to edit several dozen pages/chapters at a time. > When I find a repeating error and I use "replace all occurances" > facility with regular expressions for the tricky changes. "programmer's" and "occurrences". are you doing this on purpose? ;+) > Difficult to do with DPs single page system. right. and that's just one of many problems with their workflow... whenever an error is found, a book-wide search should be done to see if that error is systematic, and -- if so -- fixed throughout. but that capability isn't baked into their infrastructure. even worse is the fact that even if you _know_ there is an error on a specific page, you can't just go in and fix it. that's one big killer. but there are dozens of such d.p. inadequacies, so discussing them hit-and-miss is like swatting individual flies on a hot summer night. -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/d9cf47f9/attachment.htm From Bowerbird at aol.com Wed Jul 2 03:35:36 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Jul 2008 06:35:36 EDT Subject: [gutvol-d] continued confusion over at distributed proofreaders Message-ID: dave said: > When I find a repeating error and I use "replace all occurances" > facility with regular expressions for the tricky changes. sorry, i spaced on this in my last message. i meant to say that this is one of the descriptions of preprocessing -- to use programmed search routines (a la reg-ex) to locate flaws, and fix 'em en masse, usually after a very-quick consult of the scans (e.g., verifying "fagade" to "facade" 4 times doesn't take much time)... indeed, it might be the essence of preprocessing, under a microscope. (the main concept lacking is explicit focus on _book-wide_ process; it's more than just global changes; it's that some phenomenon _only_ emerge at the book-level, like whether a word should be hyphenated, or whether a name is high-frequency enough that we assume it's ok.) and what is phenomenal, what d.p. has no way of even _knowing_, since they haven't done the research that i've performed, is that it is amazing how _few_ routines it takes to move o.c.r. to high quality... this is _not_ hard to do; on the contrary, it's _so_ easy that it's crazy! (that's why i keep banging away on this topic; it's low-hanging fruit.) in a clean book, with clear typography, and relatively simple text, the o.c.r. will be highly accurate to begin with, and the flaws that you will preprocess away will be highly predictable as well, meaning you will start with very accurate text, even before line-by-line proofing... at that point, the only reliably big chunk of errors are stealth scannos, and they generally boil down to a handful in a book, at the very most. (the other big class: flecks causing flawed but harmless punctuation.) so you do the o.c.r. and you preprocess, and boom, you've got text that's clean up in the nine nine nine nines, and you're just _starting_. -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/d8a3a78a/attachment.htm From grythumn at gmail.com Wed Jul 2 05:10:12 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Wed, 2 Jul 2008 08:10:12 -0400 Subject: [gutvol-d] continued confusion over at distributed proofreaders In-Reply-To: <002701c8dc09$7c543950$74fcabf0$@co.uk> References: <486AF40F.7050606@teksavvy.com> <002701c8dc09$7c543950$74fcabf0$@co.uk> Message-ID: <15cfa2a50807020510x364c83a1o310f3972964fbf6a@mail.gmail.com> On Wed, Jul 2, 2008 at 2:04 AM, Dave Fawthrop wrote: > I use a orogrammers editor for the same error on multiple pages. > This allows me to edit several dozen pages/chapters at a time. > When I find a repeating error and I use "replace all occurances" > facility with regular expressions for the tricky changes. > > Difficult to do with DPs single page system. There are easy tools to do that on the front end (prep, PM) and backend (PP) and ways for a PM to do it while in the rounds (harder, but possible). There are also a lot of new tools introduced with Wordcheck. R C From gegut at edwardjohnson.com Wed Jul 2 06:35:33 2008 From: gegut at edwardjohnson.com (G. Edward Johnson) Date: Wed, 2 Jul 2008 09:35:33 -0400 (EDT) Subject: [gutvol-d] Open source and the Kindle Message-ID: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com> http://news.cnet.com/8301-13505_3-9982318-16.html?tag=nefd.top This week I tried downloading Jane Austen's Northanger Abbey from Project Gutenberg... The content is free. But it's not pretty. Line breaks aren't formatted for the Kindle, making the normally exceptional Kindle-reading experience...much less exceptional. For $1.60, I can have that exact same book with everything pre-formatted for me. He does seem to confuse open source with the public domain, but otherwise, it seems like a valid complaint. Not sure if it is PG's problem for having the linebreaks, or Kindle's problem for not doing a decent job of un-wrapping. Edward. http://edwardjohnson.com/ From marcello at perathoner.de Wed Jul 2 08:16:08 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 02 Jul 2008 17:16:08 +0200 Subject: [gutvol-d] Open source and the Kindle In-Reply-To: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com> References: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com> Message-ID: <486B9BB8.1060909@perathoner.de> G. Edward Johnson wrote: > http://news.cnet.com/8301-13505_3-9982318-16.html?tag=nefd.top > > This week I tried downloading Jane Austen's Northanger Abbey from Project > Gutenberg... The content is free. But it's not pretty. Line breaks aren't > formatted for the Kindle, making the normally exceptional Kindle-reading > experience...much less exceptional. For $1.60, I can have that exact same > book with everything pre-formatted for me. > > > > He does seem to confuse open source with the public domain, but otherwise, > it seems like a valid complaint. Not sure if it is PG's problem for > having the linebreaks, or Kindle's problem for not doing a decent job of > un-wrapping. He should have complained about the cretinous design of the kindle, which does not read HTML. What would it have cost to port WebKit or Gecko the the kindle? Heck, even my cellphone groks HTML. We have Northanger Abbey in both HTML and plucker for pleasurable reading on less cretinous devices. -- Marcello Perathoner webmaster at gutenberg.org From eve-news at shaw.ca Wed Jul 2 07:39:18 2008 From: eve-news at shaw.ca (Eve M. Behr) Date: Wed, 02 Jul 2008 08:39:18 -0600 Subject: [gutvol-d] Open source and the Kindle In-Reply-To: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com> References: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com> Message-ID: On Wed, 02 Jul 2008 09:35:33 -0400 (EDT), "G. Edward Johnson" wrote: >This week I tried downloading Jane Austen's Northanger Abbey from Project >Gutenberg... The content is free. But it's not pretty. Line breaks aren't >formatted for the Kindle, making the normally exceptional Kindle-reading >experience...much less exceptional. For $1.60, I can have that exact same >book with everything pre-formatted for me. Have you tried downloading this title from http://manybooks.net which seems to function as a mirror for Gutenberg. It's where I go to get Gutenberg titles for my Palm Pilot and it comes in a properly wrapped form that flows onto my screen correctly. Eve M. Behr EveB on DP ebehr at shaw.ca From sly at victoria.tc.ca Wed Jul 2 10:04:37 2008 From: sly at victoria.tc.ca (Andrew Sly) Date: Wed, 2 Jul 2008 10:04:37 -0700 (PDT) Subject: [gutvol-d] Open source and the Kindle In-Reply-To: References: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com> Message-ID: On Wed, 2 Jul 2008, Eve M. Behr wrote: > Have you tried downloading this title from http://manybooks.net which > seems to function as a mirror for Gutenberg. It's where I go to get > Gutenberg titles for my Palm Pilot and it comes in a properly wrapped > form that flows onto my screen correctly. > Yes, manybooks is interesting. They do a good job at offering PG texts in different formats for people. I only have two caveats. They appear to take the "lowest common denominator" file, that is the plain ascii text file as their basis, possibly losing some information in the process. Also because texts are converted automatically, there are sometimes problems with word-wrap happening where it should not. Andrew From dlowry8 at comcast.net Wed Jul 2 10:23:41 2008 From: dlowry8 at comcast.net (Douglas Lowry) Date: Wed, 2 Jul 2008 13:23:41 -0400 Subject: [gutvol-d] Open source and the Kindle In-Reply-To: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com> References: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com> Message-ID: <001201c8dc68$60013320$6400a8c0@dlowry> For $1.60? No, everything pre-formatted for free. Try the layout of Northanger Abbey at www.wctse.com. Click on 'A', then "Austen, Jane", then on "Northanger Abbey". Zero out the voluntary payment. That makes it free. Make sure you download the WCT Reader program, also free. If you want to know what you can do with this version, try the manual at http://www.wordsclosetogether.com/HowTOC.asp. Feedback is always appreciated at dlowry8 at comcast.net. Doug Lowry -----Original Message----- From: G. Edward Johnson [mailto:gegut at edwardjohnson.com] Sent: Wednesday, July 02, 2008 9:36 AM To: gutvol-d at lists.pglaf.org Subject: [gutvol-d] Open source and the Kindle http://news.cnet.com/8301-13505_3-9982318-16.html?tag=nefd.top This week I tried downloading Jane Austen's Northanger Abbey from Project Gutenberg... The content is free. But it's not pretty. Line breaks aren't formatted for the Kindle, making the normally exceptional Kindle-reading experience...much less exceptional. For $1.60, I can have that exact same book with everything pre-formatted for me. He does seem to confuse open source with the public domain, but otherwise, it seems like a valid complaint. Not sure if it is PG's problem for having the linebreaks, or Kindle's problem for not doing a decent job of un-wrapping. Edward. http://edwardjohnson.com/ From Bowerbird at aol.com Wed Jul 2 13:48:21 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Jul 2008 16:48:21 EDT Subject: [gutvol-d] Open source and the Kindle Message-ID: edward said: > it seems like a valid complaint oh, absolutely. > Not sure if it is PG's problem for having the linebreaks, > or Kindle's problem for not doing a decent job of un-wrapping. the number-one rule for having happy users is "don't blame the user". so let's not blame the kindle. (but i'll bet you a nickel that marcello did; that's what technoids do -- blame the user for any problems that arise.) if you want people to do a "decent" job of un-wrapping (of whatever), give 'em _a_tool_ that does a decent job of unwrapping (or whatever). so, the first failing of p.g. here is that it hasn't given people such a tool. but the more-important failing of p.g. here is that its e-texts are _not_ designed in a way that lets p.g. (or anyone) even _create_ such a tool, because there's no marking on the lines which should not be wrapped. so the user finds that sections of poetry have been wrapped incorrectly, as have blocks of various types (like address blocks, tables, and so on). i've described this problem _many_ times before, to no good resolution. the solution is quite simple -- just include one or more leading spaces on any line that should not be wrapped -- but nobody who could write this sensible rule into the guidelines has been smart enough to do it... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (www.tourtracker.com ?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/d884a898/attachment.htm From Bowerbird at aol.com Wed Jul 2 14:00:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Jul 2008 17:00:03 EDT Subject: [gutvol-d] continued confusion over at distributed proofreaders Message-ID: dave said: > > Difficult to do with DPs single page system. robert said: > There are easy tools to do that on the front end (prep, PM) > and backend (PP) dave was talking about _within_ the single-page system. > and ways for a PM to do it while in the rounds > (harder, but possible). yeah, that's what he said, it's "difficult". he could have also added that the proofer who _finds_ an error is the person who logically should be doing the search for other similar errors. while you here have the project-manager doing it, which is why _i_ said that this capability is not baked into your system. so basically, although you would like to leave some _impression_ that you have "countered" what was said, really you've done nothing but _confirm_ its accuracy... and we haven't even gotten around to social pressures (e.g., to "keep the diffs straight") that lead to the fact that project-managers almost _never_ actually do this mid-round, despite that it is "possible", as you put it... > There are also a lot of new tools > introduced with Wordcheck. again, this is mere distraction that has very little to do with the points that were raised. if you're not going to bring anything of substance to the thread, you might as well just stay silent like the rest of your d.p. counterparts. -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (www.tourtracker.com ?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/098173e4/attachment.htm From ajhaines at shaw.ca Wed Jul 2 15:32:59 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Wed, 02 Jul 2008 15:32:59 -0700 Subject: [gutvol-d] Open source and the Kindle References: Message-ID: <002601c8dc93$94fcd540$6401a8c0@ahainesp2400> Concerning the indenting of text to prevent unwanted wrapping - this article has been in PG's Volunteer FAQ for some years: http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F Al ----- Original Message ----- From: Bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com Sent: Wednesday, July 02, 2008 1:48 PM Subject: Re: [gutvol-d] Open source and the Kindle edward said: > it seems like a valid complaint oh, absolutely. > Not sure if it is PG's problem for having the linebreaks, > or Kindle's problem for not doing a decent job of un-wrapping. the number-one rule for having happy users is "don't blame the user". so let's not blame the kindle. (but i'll bet you a nickel that marcello did; that's what technoids do -- blame the user for any problems that arise.) if you want people to do a "decent" job of un-wrapping (of whatever), give 'em _a_tool_ that does a decent job of unwrapping (or whatever). so, the first failing of p.g. here is that it hasn't given people such a tool. but the more-important failing of p.g. here is that its e-texts are _not_ designed in a way that lets p.g. (or anyone) even _create_ such a tool, because there's no marking on the lines which should not be wrapped. so the user finds that sections of poetry have been wrapped incorrectly, as have blocks of various types (like address blocks, tables, and so on). i've described this problem _many_ times before, to no good resolution. the solution is quite simple -- just include one or more leading spaces on any line that should not be wrapped -- but nobody who could write this sensible rule into the guidelines has been smart enough to do it... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (www.tourtracker.com ?NCID=aolmus00050000000112) ------------------------------------------------------------------------------ _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/7de7ce15/attachment.htm From Bowerbird at aol.com Wed Jul 2 17:14:22 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 2 Jul 2008 20:14:22 EDT Subject: [gutvol-d] Open source and the Kindle Message-ID: al said: > this?article has been in?PG's Volunteer FAQ for some years: sorry, i spoke "metaphorically" about "writing it into the guidelines". what i _really_ meant was _enforcing_ the policy in the actual e-texts. you know, so actual users could actually unwrap those actual e-texts. if anyone needs, i can show you actual e-texts posted in the last week where this was not done... and thousands posted in the last decade... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (www.tourtracker.com ?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/f2173ec4/attachment.htm From marcello at perathoner.de Wed Jul 2 17:42:11 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 03 Jul 2008 02:42:11 +0200 Subject: [gutvol-d] Open source and the Kindle In-Reply-To: <002601c8dc93$94fcd540$6401a8c0@ahainesp2400> References: <002601c8dc93$94fcd540$6401a8c0@ahainesp2400> Message-ID: <486C2063.7090300@perathoner.de> BB wrote: >> so let's not blame the kindle. (but i'll bet you a nickel that >> marcello did; that's what technoids do -- blame the user for any >> problems that arise.) So blaming the kindle is blaming the user? This is sub-standard thinking, even by your standards. The following is more like you. Without understanding, without having done any research whatsoever you just open your tusked snout and let out the voice of God: >> the solution is quite simple -- just include one or more leading >> spaces on any line that should not be wrapped -- but nobody who >> could write this sensible rule into the guidelines has been smart >> enough to do it... Then Al Haines wrote: > Concerning the indenting of text to prevent unwanted wrapping - this > article has been in PG's Volunteer FAQ for some years: > > http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F That was a very unfortunate oversight. But no, No, NEVER, try to weasel your way back out: BB wrote: > sorry, i spoke "metaphorically" about "writing it into the guidelines". Aaaaarghhh! Never admit defeat, never. Wrong, wrong, wrong. Bad troll! -- Marcello Perathoner webmaster at gutenberg.org From tb at baechler.net Thu Jul 3 02:26:41 2008 From: tb at baechler.net (Tony Baechler) Date: Thu, 03 Jul 2008 02:26:41 -0700 Subject: [gutvol-d] Apertium: Open source machine translation Message-ID: <486C9B51.3010603@baechler.net> All, I'm not 100% sure exactly what this is, but I thought it might be of interest to some here who have commented on machine translations in the past. I haven't used the software and I don't know anymore than what it says below. It would be interesting to see how accurate the translation engine is. google for example can translate text but doesn't do a great job of it. [Apertium][1] is an open source shallow-transfer machine translation (MT) system. In addition to the translation engine, it also provides tools for manipulating linguistic data, and translators designed to run using the engine. At the time of writing, there are stable bilingual translators available for English-Catalan, English-Spanish, Catalan-Spanish, Catalan-French, Spanish-Portuguese, Spanish-Galician, and French-Spanish; as well as monolingual translators that translate from Esperanto to Catalan and to Spanish, and from Romanian to Spanish. There are also a number of unstable translators in various stages of development. (A [list of language pairs][2], updated daily, is available on the [Apertium wiki][3]). [1]: http://www.apertium.org [2]: http://wiki.apertium.org/wiki/List_of_language_pairs [3]: http://wiki.apertium.org/wiki/Main_Page URL: http://linuxgazette.net/152/oregan.html From Bowerbird at aol.com Thu Jul 3 11:05:20 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 3 Jul 2008 14:05:20 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 001 Message-ID: ok, here's a new series, on how to clean up text from o.c.r. distributed proofreaders calls such clean-up "preprocessing", because it's done prior to the text being sent off to proofers... i'll show you how you can use a text-editor to do the clean-up. mostly i'll be using plain-english to tell you what to search for, but sometimes i'll have to resort to reg-ex (regular expressions) for simplicity, so you should have a text-editor that does reg-ex. i'll be using "blood mountain", the latest test-book by roger frank. to kick off this series, i'll repeat a tip i offered a while back... 1. search for all lines that start with a semi-colon. in the o.c.r. from "blood mountain", there were three such lines: > "I have been shot at," Valentine Simmons replied > ; "behind my back. The men who fail are like > Her breathing increasingly grew labored, oppressed > ; a little sob escaped, softly miserable. She > The lines on Gordon's thin, dark face had multiplied > ; his eyes, in the shadow of his bony forehead, for all three, it's pretty obvious the semicolon belongs on the previous line, and merely needs to be shifted up there. so, in my text-editor -- where "^c" stands for a line-end -- i just do a global change of "^c; " to ";^c", and step through the 3 occurrences and approve each one individually. done. -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080703/fdfbaad8/attachment-0001.htm From schultzk at uni-trier.de Fri Jul 4 01:17:38 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Fri, 4 Jul 2008 10:17:38 +0200 Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 001 In-Reply-To: References: Message-ID: Hi Bowerbird, As a general rule almost all punctuation in english should not be at the beginning of a line. Except quote marks and opening brackets! Dashes I guess are O.K. regards Keith. Am 03.07.2008 um 20:05 schrieb Bowerbird at aol.com: > ok, here's a new series, on how to clean up text from o.c.r. > > distributed proofreaders calls such clean-up "preprocessing", > because it's done prior to the text being sent off to proofers... > > i'll show you how you can use a text-editor to do the clean-up. > mostly i'll be using plain-english to tell you what to search for, > but sometimes i'll have to resort to reg-ex (regular expressions) > for simplicity, so you should have a text-editor that does reg-ex. > > i'll be using "blood mountain", the latest test-book by roger frank. > > to kick off this series, i'll repeat a tip i offered a while back... > > 1. search for all lines that start with a semi-colon. > > in the o.c.r. from "blood mountain", there were three such lines: > > > "I have been shot at," Valentine Simmons replied > > ; "behind my back. The men who fail are like > > > Her breathing increasingly grew labored, oppressed > > ; a little sob escaped, softly miserable. She > > > The lines on Gordon's thin, dark face had multiplied > > ; his eyes, in the shadow of his bony forehead, > > for all three, it's pretty obvious the semicolon belongs on > the previous line, and merely needs to be shifted up there. > > so, in my text-editor -- where "^c" stands for a line-end -- > i just do a global change of "^c; " to ";^c", and step through > the 3 occurrences and approve each one individually. done. > > -bowerbird > > > > ************** > Gas prices getting you down? Search AOL Autos for fuel-efficient > used cars. > (http://autos.aol.com/used?ncid=aolaut00050000000007) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/84d1972e/attachment.htm From Bowerbird at aol.com Fri Jul 4 02:55:27 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Jul 2008 05:55:27 EDT Subject: [gutvol-d] happy bird-day Message-ID: happy birthday to project gutenberg! thank you michael! thank you, anonymous grocery-chain marketers, for printing the declaration of independence on those shopping bags! -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/ed5c744c/attachment.htm From Bowerbird at aol.com Fri Jul 4 03:18:14 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Jul 2008 06:18:14 EDT Subject: [gutvol-d] World eBook Fair July 4-August 4 Message-ID: > A Million Plus Books Free for the Taking! > July 4 2008 you know, in years past i've scoffed at that number. and i would continue to scoff at it this year, because let's face the facts, there are a lot of duplicates there, so the number is vastly inflated. however, just the other day, i was reading a piece that a group of publishers is now reporting they have found a truly massive number of scanned textbooks online... let's see if i can dig it up... > http://chronicle.com/free/2008/07/3623n.htm it says: > the Association of American Publishers hired an outside law firm > this summer to scour the Web for illegally offered textbooks. > Already the firm has identified thousands of instances of > book piracy and has sent legal notices to Web sites hosting the files > demanding that they be removed. The group is looking for all types of > books, though trade books and textbooks, which generally have high > price tags, are the most frequent books offered on peer-to-peer sites. > > "In any given two-week period we found from 60,000 files > all the way up to 250,000 files," said Edward McCoyd, > director of digital policy for the publishing association. so maybe there really _are_ a lot of scanned books out there. or maybe this is just one of those scary numbers that the corporations like to throw out, to make it sound like they're losing scads of money... indeed, the website mentioned -- textbook torrents dot com -- says: > "There are very few scanned textbooks in circulation, and > that's what we're here to change," says a welcome message > on the Textbook Torrents site. "Chances are you have some > textbooks sitting around, so pick up a scanner and start scanning it!" they actually only declare 5000 scanned textbooks so far, which is -- shall we say -- a far cry from the 60,000-250,000 the industry claims to have found. but hey, 5000 is better than nothing, isn't it? and boy, you have to love their "start scanning" attitude! :+) with the book industry escalating its attempts to rip off their customers, maybe more and more people will soon be "picking up" their scanners... and maybe by next year, there really _will_ be a million books out there... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/a39873c2/attachment.htm From Bowerbird at aol.com Fri Jul 4 03:22:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Jul 2008 06:22:03 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 001 Message-ID: keith said: > As a general rule almost all punctuation in english should not be > at the beginning of a line. Except quote marks and opening brackets! you're absolutely right, keith. which means we could formulate a reg-ex that'd search for all of them in one pass, if we wanted to. but i've found it's better to do them one-at-a-time. first, that helps focus in your attention much better. second, the actions required for each one often vary, between them, but are consistent within themselves. this will become clear as we step through all of them. third, it's just kind of interesting (to me, anyway) to see just how many instances turn up for each mark. > Dashes I guess are O.K. we'll separate single-dashes from double-dashes. single-dashes at the beginning of a line are usually scanning mistakes, since single-dashes are typically hyphens used in hyphenated words, so they will be located at the _end_ of the line, and not at the start. em-dashes -- or double-dashes to an o.c.r. app -- are however quite often found at the start of a line. so a single-dash at the start of the line is probably a double-dash improperly recognized by the o.c.r. indeed, of the 22 cases of this in "blood mountain", all but one of 'em should have been a double-dash. so that will be a global change from "^c-" to "^c--", which is different from the global change for "^c;", and thus is an example of what i mentioned above, where i said the remedy varies for different marks... also, in terms of the "focus" point, when i know that i'm now looking for a dash at the beginning of a line -- and nothing else -- i'm more likely to spot it when -- as actually occurs on p#346 of "blood mountain -- > http://www.z-m-l.com/go/mount/mountp346.html the o.c.r. _missed_ the dash at the start of the 4th line. however, since i was looking at the second dash there, and my attention was focused on that particular mark, i was alerted to the fact that there should have been two instances highlighted on that page, so i caught it. whenever you do preprocessing, you have to be alert for errors that are on the periphery of your attention. *** d.p. has an _extremely_stupid_ policy on em-dashes, where they will bring them up from the start of a line to the end of the previous line, and then will _also_ bring up a word from that next line as well, and they have no spaces on either side of the em-dash, meaning they often have very long lines followed by short ones. (of course, this also happens with their dehyphenation.) while it's easy enough for me to change "--" to " -- ", thus enabling more-esthetic linewrapping to happen -- it doesn't matter if the dash is at the end of one line or the start of the next, it means the same darn thing -- i wish they'd just leave the original linebreaks alone... this unnecessary rewrapping from the original breaks is -- in the long run -- going to cause me to _jettison_ the p.g. e-texts completely, good only for proofing my re-done o.c.r., where i've retained original linebreaks... users need the _option_ of rewrapping, or of retaining the linebreaks from the original book. users don't need to have rewrapping forced on them. -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/ca6dfbd1/attachment.htm From hyphen at hyphenologist.co.uk Fri Jul 4 06:52:54 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Fri, 4 Jul 2008 14:52:54 +0100 Subject: [gutvol-d] World eBook Fair July 4-August 4 In-Reply-To: References: Message-ID: <002401c8dddd$483c6e50$d8b54af0$@co.uk> Bowerbird at aol.com wrote >> A Million Plus Books Free for the Taking! >> July 4 2008 >you know, in years past i've scoffed at that number. >and i would continue to scoff at it this year, because >let's face the facts, there are a lot of duplicates there, >so the number is vastly inflated. If you *read* mh's post you will note that the claim was 1,000,000 Books whereas the count was 1,210,000, and increasing. 20% for duplicates seems reasonable to me. Dave Fawthrop -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/63b568b2/attachment-0001.htm From Bowerbird at aol.com Fri Jul 4 10:21:41 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Jul 2008 13:21:41 EDT Subject: [gutvol-d] World eBook Fair July 4-August 4 Message-ID: dave said: > If you *read* mh?s post you will note that > the claim was 1,000,000 Books whereas > the count was 1,210,000, and increasing. i did read it. > 20% for duplicates seems reasonable to me. i put the number of duplicates at more like 50%. here's the inflated count: > ? ~100,000+ from Project Gutenberg > ? ~500,000+ from The World Public Library > ? ~450,000? from The Internet Archive > ? ~160,000? from eBooks About Everything > ---------- > ~1,210,000+ Grand Total as of July 1, 2008 here's a much more reasonable guess: > ? ~050,000 from Project Gutenberg > ? ~200,000 from The World Public Library > ? ~250,000? from The Internet Archive > ? ~050,000? from eBooks About Everything > --------- > ~550,000 Grand Total as of July 1, 2008 multiple copies of a project gutenberg e-text count as 1 book, just 1, no matter how many libraries make a copy, by any rational census... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/0c2fb7cf/attachment.htm From Bowerbird at aol.com Fri Jul 4 11:06:54 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Jul 2008 14:06:54 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 002 Message-ID: ok, since keith brought it up, here are the 22 cases in "blood mountain" of a single-dash at a linestart... 2. search for all lines that start with a semi-colon. the dash on the pagenumber for page 125 was a speck, so it was the sole exception to a global change from "-" to "--". and, as i mentioned earlier, i noticed the absence of one em-dash at the beginning of an o.c.r. line, so i added it... 27 corrections made, on 2 routines... by the way, you can get a copy of the o.c.r. text here: > http://www.pgdp.net/c/project.php?id=projectID4865040815e01 if you'd like to follow along and verify my statements. i'll be back tomorrow with the next tip in this series... -bowerbird -- singledash at linestart -- On the occasions when he was too drunk to drive -not over often--a substitute was quietly found "is always made a target for the abuse of the -the thoughtless. But he usually comes to the -----File: 073.png -Clare dangerously ill ... a question of dying, eighty dollars in his pocket. He had another vision -of Simmons; it was two hundred and fifty dollars nuto passage of a symphony; "but it's all one to me -there's nothing else they can take; I'm free, free Delaying his expression of gratitude to the priest -he could stop on his return with trout--Gordon business with it, a ... a gun store,--I like guns, -here in Greenstream. And I'd sharpen scythes, together, and he made a list of what I would need -files and vises and parts of guns. If I mailed my in the dark house.... He shut his eyes for a -[125] "Wouldn't I?" she exclaimed; "oh, wouldn't I? -smart crowds and gay streets and shops on fire laboriously polite, "the next time--I'll do it! -when I'm in Stenton again I'll bring you a pair from the rough, minor forms into the bigger sweep -it was like a great, green bed half filled with a got penetrating as a musket. Rose is just like her -she's all taffy now on that young man, but in a "whenever you like. Of course it's a fine article -all strung on gold wire. I won't be surprised a little from his blood. She demanded a great deal -a man could never return. He bitterly cursed his was driving, and by her side ... Lettice! Lettice -riding over the rough field, over the dark stony -----File: 267.png -what man had not?--but this was different; this man with his crop a failure on the field like, well -we'll say, Cannon does, with a note in my hand inhibition had arisen in the negotiations -he had destroyed him with Gordon's own blindness, thousand dollars to get them, and they're worth -that," he flung them with a quick gesture into the Stenton stage," he shrilled; "and I made out to ask -you can take it or leave it--if you'd drive again? ... others ... new courage, example of bigness -Why! what's the matter with you, Makimmon? ====================================== plus replacement of 1 missing em-dash on page 346 "Have you got the options?" Entriken demanded "all them that Pompey had and you bought?" "Have you got the options?" Entriken demanded -- "all them that Pompey had and you bought?" ====================================== ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/eb32eab2/attachment.htm From Bowerbird at aol.com Fri Jul 4 11:57:12 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 4 Jul 2008 14:57:12 EDT Subject: [gutvol-d] speaking of happy bird-day Message-ID: speaking of bird-days, it's also the birthday of "alice in wonderland". meaning that it's probably a good time to report that the p.g. e-text of this wonderful story has recently had a facelift from david widger. i haven't checked out the new version, but i'm sure it's quite spiffy... so happy bird-day, alice. you sure don't look 146 years old... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/a8c398d4/attachment.htm From prosfilaes at gmail.com Fri Jul 4 18:23:30 2008 From: prosfilaes at gmail.com (David Starner) Date: Fri, 4 Jul 2008 21:23:30 -0400 Subject: [gutvol-d] World eBook Fair July 4-August 4 In-Reply-To: <002401c8dddd$483c6e50$d8b54af0$@co.uk> References: <002401c8dddd$483c6e50$d8b54af0$@co.uk> Message-ID: <6d99d1fd0807041823l6edf8ffcu1e3f6cd38e40b064@mail.gmail.com> On Fri, Jul 4, 2008 at 9:52 AM, Dave Fawthrop wrote: > If you *read* mh's post you will note that the claim was 1,000,000 > > Books whereas the count was 1,210,000, and increasing. > > 20% for duplicates seems reasonable to me. It also claimed that PG donated 100,000 books, which would indicate the duplicate count was a bit higher than that. From Bowerbird at aol.com Sat Jul 5 12:44:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 5 Jul 2008 15:44:28 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 003 Message-ID: and we continue with our series on routines to preprocess o.c.r. 3. search for a lowercase letter followed by an uppercase letter. > PHWTED IN THE T7NITED STATES OlfAMEEIOA > and playing him out. Come here, General JacK-son." > George Gordon MacKimmon, resting on the porch *** embedded cap correct > everywhere; Gordon had pitched the headstalFinto > by George Gordon MacKimmon from world-old *** embedded cap correct > General Jackson moved forward over the porcK. we see that 2 of the 6 presenting cases are _correct_ -- as the last name of "mackimmon" has an embedded cap, such last names a common "false alarm" with this routine, -- so we are left with 4 lines that need to be corrected... 4 more lines corrected, for a grand total of 31, on 3 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080705/989b7ce3/attachment.htm From hart at pglaf.org Sat Jul 5 15:54:41 2008 From: hart at pglaf.org (Michael Hart) Date: Sat, 5 Jul 2008 15:54:41 -0700 (PDT) Subject: [gutvol-d] World eBook Fair July 4-August 4 In-Reply-To: References: Message-ID: OK, it's one thing for you to trash me, I can, and have, walked away from that. But when you start trying to trash mathematics and logic, well, then something must be said. Starting with the largest collections: 1. A serious look at World Public Library would indicate the number of duplications at approximately 10%. Even if you dicounted those from different paper editions of the same title, you couldn't say 15% in reality. Then again, there are more than just formatting diffs for the .txt, .html, .pdf, etc. files in many cases. Personally, I think we will end up with 30 legitimate for all the great books, and should keep and count them all-- but even with discounting a total 10% that leaves: Starting with: 500,000+ as their grand total subtracting out 10% as duplicates: 450,000 that are not duplicates. 2. Then counting TWICE as many duplication at Internet Archive, due to those also at The World Public Library: 450,000+ as their grand total subtracting out 20% as duplicates: 370,000 that are not duplicates. Thus, the first sub-total is: 820,000 that are not duplicates. 3. The 17,000 music scores are obviously NOT dupes. Thus, the second sub-total is: 837,000 that are not duplicates. 4. Project Gutenberg Out of some 28,000 - 29,000 files, only the first 17,000 are listed by either Internet Archive/World Public Library. This leave about 12,500 not listed in the above libraries. Thus the third sub-total is: ~850,000 that are not duplicates. 5. As for the commercial eBooks, these all their their own editors, artwork, etc., though I cannot say how like an earlier paper or ebook edition they are, as far as I am informed each has its own copyright as must be listed a totally separate way by any library. Thus the fourth sub-total is: 1,010,000 that are not duplicates. 6. Internet archive is adding about 1,000 per day. Given only 20 total business days from July 4 - Aug 4-- they plan to add about 20 more books to the: ~453,000 already there as of the last business day. Unless these new items are duplicates, which I doubt is the case with the current batch, their final total will be approximately: 473,000 And we shoudl add perahps 20,000 to the above sub-total: 7. Thus our final grand total, wiping out 200,000 or so from the highest possible grand total of no more than: 1,250,000 [given all other additions] to create a "deflated total" of: 1,050,000 [or somewhere thereabouts]. Obviously there is room to argue a few percent, But unless you go more overboard on duplications than any of the librarians I have asked, it will not be all that different a grand total. The illogical examples given below want to take out any possible duplication over and over, but with only 17,000 even LISTED from PG, you could not change the grand total by much over 50,000, even if you took them ALL OUT THREE TIMES OVER. And even having taken them out TWICE, we still, and yet, have 50,000 to spare. Again, it's not exact counting, and none of our standards seem to be met by many of these books listed here, but I'll put over half these books up against the better half of what U Michigan's report claimed a month or so ago, or Google's-- or another other collection over a million. HOWEVER. . .I would LOVE to come back a year or two or three from now and hear that there are a whole million well proofread full text eBooks! Thank You!!! Give the world eBooks for 2008!!! Don't forget: Over 1.2 million eBooks starting July 4 at: http://www.worldebookfair.org Ends August 4. Michael S. Hart Founder Project Gutenberg Inventor of eBooks 100,000 eBooks easy to download at: http://www.gutenberg.org [over 28,000 eBooks] http://www/gutenberg.cc [over 75,000 eBooks] Http://gutenberg.net.au Project Gutenberg of Australia ~1640+ http://pge.rastko.net 65 languages PG of Europe ~500+ http://gutenberg.ca Project Gutenberg of Canada ~100+ http://preprints.readingroo.ms Not Primetime Ready ~387 >>> Your Project Gutenberg Site Could Be Listed Here <<< Don't forget Project Runeberg for Scandinavian languages. Blog at http://hart.pglaf.org On Fri, 4 Jul 2008, Bowerbird at aol.com wrote: > dave said: >> If you *read* mh??s post you will note that >> the claim was 1,000,000 Books whereas >> the count was 1,210,000, and increasing. > > i did read it. > > >> 20% for duplicates seems reasonable to me. > > i put the number of duplicates at more like 50%. > > here's the inflated count: >> ? ~100,000+ from Project Gutenberg >> ? ~500,000+ from The World Public Library >> ? ~450,000? from The Internet Archive >> ? ~160,000? from eBooks About Everything >> ---------- >> ~1,210,000+ Grand Total as of July 1, 2008 > > here's a much more reasonable guess: >> ? ~050,000 from Project Gutenberg >> ? ~200,000 from The World Public Library >> ? ~250,000? from The Internet Archive >> ? ~050,000? from eBooks About Everything >> --------- >> ~550,000 Grand Total as of July 1, 2008 > > multiple copies of a project gutenberg e-text > count as 1 book, just 1, no matter how many > libraries make a copy, by any rational census... > > -bowerbird > > > > ************** > Gas prices getting you down? Search AOL Autos for > fuel-efficient used cars. > (http://autos.aol.com/used?ncid=aolaut00050000000007) > From keichwa at gmx.net Sat Jul 5 18:18:38 2008 From: keichwa at gmx.net (Karl Eichwalder) Date: Sun, 06 Jul 2008 03:18:38 +0200 Subject: [gutvol-d] World eBook Fair July 4-August 4 Message-ID: "David Starner" writes: > It also claimed that PG donated 100,000 books, which would indicate > the duplicate count was a bit higher than that. It's the known nonsense Mr Hart uses to post since several years. Just ignore it. -- Karl Eichwalder From Bowerbird at aol.com Sat Jul 5 23:18:41 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 6 Jul 2008 02:18:41 EDT Subject: [gutvol-d] World eBook Fair July 4-August 4 Message-ID: michael- i'm gonna give you a night or two to sleep on it, and if you still want to stand behind that last post, i'll send my response... you let me know... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080706/0759fa34/attachment.htm From Morasch at aol.com Sun Jul 6 13:45:43 2008 From: Morasch at aol.com (Morasch at aol.com) Date: Sun, 6 Jul 2008 16:45:43 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 004 Message-ID: here's another twist on casing anomalies: 4. search for two uppercase letters followed by a lowercase letter. 3 lines presented, and were corrected as indicated: > ITwas his own home to which he returned, the > It was his own home to which he returned, the > commonplace. He saw TOPable sitting on > commonplace. He saw Tol'able sitting on > JBeggs added; "your money's tight around his neck." > Beggs added; "your money's tight around his neck." 3 more lines corrected, for a grand total of 34, on 4 routines... i'll be back tomorrow with the next routine in this series... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080706/4c553f9a/attachment.htm From Bowerbird at aol.com Sun Jul 6 23:56:35 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Jul 2008 02:56:35 EDT Subject: [gutvol-d] i sent this over to the discussion listserve at openlibrary Message-ID: i sent this over to the discussion listserve at openlibrary *** the o.c.r. on the books you scan is _still_ fatally flawed! it's missing em-dashes, and probably more characters too, but i can't stand to look at it, because it turns my stomach... _when_ are you going to clear up this _significant_ problem? it's getting extremely difficult for me to shake the feeling that absolutely nobody there cares about the quality of what you do. and that's a crying shame, because so many people depend on you. -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080707/925e429e/attachment.htm From Bowerbird at aol.com Mon Jul 7 08:26:02 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Jul 2008 11:26:02 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 005 Message-ID: 5. search for lines containing multiple spaces... > the trap. _{t} one line. ok. fixed. 1 more line corrected, for a grand total of 35, on 5 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080707/f2729feb/attachment.htm From Bowerbird at aol.com Mon Jul 7 09:46:08 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 7 Jul 2008 12:46:08 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- roadmap Message-ID: just so you know where we're going with this... when we're done with the clean-up for this book, there will be nothing but a handful of errors left that the human proofers will have to find and fix... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080707/386bca3e/attachment.htm From hart at pglaf.org Mon Jul 7 13:00:12 2008 From: hart at pglaf.org (Michael Hart) Date: Mon, 7 Jul 2008 13:00:12 -0700 (PDT) Subject: [gutvol-d] !@! Source files possible? Message-ID: ---------- Forwarded message ---------- Date: Mon, 7 Jul 2008 09:41:55 -0400 From: Vincent Terreri To: hart at pobox.com Subject: Source files possible? First of all, let me tell you that you are doing a wonderful work in this project. I have recently signed up to volunteer for proofing. Is there any way of getting a view of the scanned pages from which the following etext was developed? Title: Ritchie's Fabulae Faciles A First Latin Reader Author: John Kirtland, ed. Release Date: September, 2005 [EBook #8997] [Yes, we are more than one year ahead of schedule] [This file was first posted on August 31, 2003] Edition: 10 Language: English Character set encoding: ASCIIEnd of Project Gutenberg's Ritchie's Fabulae Faciles, by John Kirtland, ed. *** END OF THE PROJECT GUTENBERG EBOOK RITCHIE'S FABULAE FACILES *** This file should be named 7flrd10.txt or 7flrd10.zip Corrected EDITIONS of our eBooks get a new NUMBER, 7flrd11.txt VERSIONS based on separate sources get new LETTER, 7flrd10a.txt Produced by Karl Hagen, Tapio Riikonen and Online Distributed Proofreaders I would like to use the text in my Latin classes next year, but am interested in how the lines are numbered in the original text. It would help considerably in preparing text that was more user friendly to my middle school students. Anything you can do would be greatly appreciated. Please let me know if there is another address I should send this request. All the best, Vincent Vincent Terreri 703-431-7467 mobile phone 540-668-7157 home phone 801-459-3733 fax number No virus found in this outgoing message. Checked by AVG. Version: 7.5.524 / Virus Database: 270.4.5/1537 - Release Date: 7/6/2008 5:26 AM From grythumn at gmail.com Mon Jul 7 13:21:40 2008 From: grythumn at gmail.com (Robert Cicconetti) Date: Mon, 7 Jul 2008 16:21:40 -0400 Subject: [gutvol-d] !@! Source files possible? In-Reply-To: References: Message-ID: <15cfa2a50807071321q138fcd3fx9b1fa5bd81947e33@mail.gmail.com> Page images are at: http://www.pgdp.org/ols/tools/display.php?book=3f2145e242d4f&nextpage=001.png&numpages=150 The R2 output is archived somewhere at IA, but that requires human intervention to retrieve. R C On Mon, Jul 7, 2008 at 4:00 PM, Michael Hart wrote: > > > ---------- Forwarded message ---------- > Date: Mon, 7 Jul 2008 09:41:55 -0400 > From: Vincent Terreri > To: hart at pobox.com > Subject: Source files possible? > > > First of all, let me tell you that you are doing a wonderful work in this > project. I have recently signed up to volunteer for proofing. > > > > Is there any way of getting a view of the scanned pages from which the > following etext was developed? > > > > Title: Ritchie's Fabulae Faciles > > A First Latin Reader > > > > Author: John Kirtland, ed. > > > > Release Date: September, 2005 [EBook #8997] > > [Yes, we are more than one year ahead of schedule] > > [This file was first posted on August 31, 2003] > > > > Edition: 10 > > > > Language: English > > > > Character set encoding: ASCIIEnd of Project Gutenberg's Ritchie's Fabulae > Faciles, by John Kirtland, ed. > > > > *** END OF THE PROJECT GUTENBERG EBOOK RITCHIE'S FABULAE FACILES *** > > > > This file should be named 7flrd10.txt or 7flrd10.zip > > Corrected EDITIONS of our eBooks get a new NUMBER, 7flrd11.txt > > VERSIONS based on separate sources get new LETTER, 7flrd10a.txt > > > > Produced by Karl Hagen, Tapio Riikonen and Online Distributed Proofreaders > > > > I would like to use the text in my Latin classes next year, but am > interested in how the lines are numbered in the original text. It would > help considerably in preparing text that was more user friendly to my middle > school students. > > > > Anything you can do would be greatly appreciated. Please let me know if > there is another address I should send this request. > > > > All the best, > > > > Vincent > > > > > > Vincent Terreri > > 703-431-7467 mobile phone > > 540-668-7157 home phone > > 801-459-3733 fax number > > > > > No virus found in this outgoing message. > Checked by AVG. > Version: 7.5.524 / Virus Database: 270.4.5/1537 - Release Date: 7/6/2008 > 5:26 AM > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From ajhaines at shaw.ca Mon Jul 7 13:26:52 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Mon, 07 Jul 2008 13:26:52 -0700 Subject: [gutvol-d] !@! Source files possible? References: Message-ID: <000901c8e06f$caf00b60$6401a8c0@ahainesp2400> This book is at Internet Archive: http://www.archive.org/details/fabulaefacilesfi00ritcrich PDF versions are available from that page, in the "View the book" box.. Clicking on the HTTP link (at the bottom of that box) gives access to GIF, JP2 (JPEG 2000), and assorted other versions. It's probably best to ignore IA's text version in favor of PG's - IA's text files are pretty raw, to put it mildly. Al ----- Original Message ----- From: "Michael Hart" To: "The gutvol-d Mailing List" Cc: Sent: Monday, July 07, 2008 1:00 PM Subject: [gutvol-d] !@! Source files possible? > > > ---------- Forwarded message ---------- > Date: Mon, 7 Jul 2008 09:41:55 -0400 > From: Vincent Terreri > To: hart at pobox.com > Subject: Source files possible? > > > First of all, let me tell you that you are doing a wonderful work in this > project. I have recently signed up to volunteer for proofing. > > > > Is there any way of getting a view of the scanned pages from which the > following etext was developed? > > > > Title: Ritchie's Fabulae Faciles > > A First Latin Reader > > > > Author: John Kirtland, ed. > > > > Release Date: September, 2005 [EBook #8997] > > [Yes, we are more than one year ahead of schedule] > > [This file was first posted on August 31, 2003] > > > > Edition: 10 > > > > Language: English > > > > Character set encoding: ASCIIEnd of Project Gutenberg's Ritchie's Fabulae > Faciles, by John Kirtland, ed. > > > > *** END OF THE PROJECT GUTENBERG EBOOK RITCHIE'S FABULAE FACILES *** > > > > This file should be named 7flrd10.txt or 7flrd10.zip > > Corrected EDITIONS of our eBooks get a new NUMBER, 7flrd11.txt > > VERSIONS based on separate sources get new LETTER, 7flrd10a.txt > > > > Produced by Karl Hagen, Tapio Riikonen and Online Distributed Proofreaders > > > > I would like to use the text in my Latin classes next year, but am > interested in how the lines are numbered in the original text. It would > help considerably in preparing text that was more user friendly to my middle > school students. > > > > Anything you can do would be greatly appreciated. Please let me know if > there is another address I should send this request. > > > > All the best, > > > > Vincent > > > > > > Vincent Terreri > > 703-431-7467 mobile phone > > 540-668-7157 home phone > > 801-459-3733 fax number > > > > > No virus found in this outgoing message. > Checked by AVG. > Version: 7.5.524 / Virus Database: 270.4.5/1537 - Release Date: 7/6/2008 > 5:26 AM > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Tue Jul 8 00:15:39 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 8 Jul 2008 03:15:39 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 006 Message-ID: it's very good to do an early search for garbage characters... this is a search that benefits from using regular expressions. the reg-ex is something along these lines: [\&\*\<\>\\\/\|\*\{\}\_] you'll usually also want to search for high-bit ascii characters... 6. search for lines with garbage characters, and edit as necessary. > In this manner his father, |ust such another, had > _Mr. Ottinger elected to imbibe his "straight" > _{cr}ippling, the other. A chair fell, sliding across the > harsh, lik_{t}e a discordant bell clashing in the soste- > of men; envy was perceptible, bitterness *" ... for > "A 'little stroll.' *" Buckley produced a heavy > It enraged him that she was so collected; her body,* > *?217] > employed Mrs. Caley. The grea vast, indefinable peril, blacker than night, lo&ming > \ > the throes of a new piece, Mc*Ginty, and Gordon > *?330} > the trap. _{t} > "Sim," Gordon demanded sharply, "_{you} never > of wrath, his arm rose, with a finger indicating the* > "'Give it to him,' *" Gordon repeated thinly. "I > "?' dam' idiot," Gordon mumbled, "if I die out 18 hits. of note is that the second line -- "elected to imbibe" -- had also been separated from its paragraph, on both ends, so was reunited. in a similar vein, the seventh line -- "she was so collected" -- was improperly joined to the paragraph above, so i added a blank line. moreover, the line with "\" also involved a badly broken paragraph. we'll do a check on paragraphing later, but when you see a glitch, you should correct it right away, even if it's not the type you were "looking for" at the current moment. the most efficient time to do a fix is right when you see an error. don't wait until later to fix it. 22 more lines corrected, for a grand total of 57, on 6 routines... i'll be back tomorrow with the next routine in this series... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080708/228933ee/attachment.htm From Bowerbird at aol.com Wed Jul 9 02:46:24 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Jul 2008 05:46:24 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 007 Message-ID: 7. search for letter-number or number-letter combos... (in reg-ex terms, this will be "[a-z][0-9]" or "[0-9][a-z]".) letter-number > PHWTED IN THE T7NITED STATES OlfAMEEIOA > 12Q4J > [247J > P25J number-letter > PHWTED IN THE T7NITED STATES OlfAMEEIOA > fl04] > 12Q4J > P25J 4 of each, with 3 overlapping, for a total of 5 unique lines... of these 5 unique lines, 4 of them involved _pagenumbers_. > PHWTED IN THE T7NITED STATES OlfAMEEIOA > fl04] > 12Q4J > [247J > P25J 5 more lines corrected, for a grand total of 62, on 7 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080709/62147f10/attachment.htm From Bowerbird at aol.com Wed Jul 9 03:33:45 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Jul 2008 06:33:45 EDT Subject: [gutvol-d] how to digitize a book, step by step -- 001 Message-ID: i can see it's time for me to rewind and review the whole process, from the start. voila, "how to digitize a book, step by step..." first off, choose a book that you really want to get intimate with. you're gonna be doing an internal inspection of the thing's guts, so it might as well be a book you really like, or think you might... next, get a clean copy. try to get several copies, and use the cleanest. third, thumb through the book and familiarize yourself with it fully... are there illustration pages? are they numbered or unnumbered? how many pages are there in it? what is the first numbered page? how many unnumbered front-matter pages come before that page? number back from the first numbered page, hopefully back to _1_, which will be the title-page. if there's anything before the title-page, like a frontispiece, you should shuffle it so that it comes _afterward_. it's perfectly ok to shuffle unnumbered frontmatter pages, and even delete unnecessary pages (especially blank pages) so the numbers will come out right. fourth, start scanning... do a careful job. it _does_ make a difference. try to make the scans as straight as possible, so you'll get good o.c.r. also try to position 'em consistently on the scan-bed, for even margins. start scanning at the first numbered page, and set the default filename to that number. after that, the o.c.r. will increment the default filename _automatically_, which means you'll want to _skip_ unnumbered pages on this pass. you'll come back to 'em later, and name them _manually_. you'll also do the frontmatter pages, and name those manually as well... when you're done, every scan will be _named_ with its _pagenumber_, meaning you won't have to do any guesswork to know what file is what. -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080709/5d7b3797/attachment.htm From Bowerbird at aol.com Wed Jul 9 03:35:02 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Jul 2008 06:35:02 EDT Subject: [gutvol-d] how to digitize a book, step by step -- 002 Message-ID: now, let's see what happens when you _don't_ name your files wisely... i'm gonna use this book, which is currently in-process over at d.p.: > http://www.pgdp.net/c/project.php?id=projectID4870dfe646daf it's called "the cabin on the prairie"... so, i've explored this book, and i can tell you a little bit about it... the numbered pages start at page 5. there are 5 pages _before_ that page, but one of them is blank (whew), so it can (and should) be eliminated, of course. so toss the blank page. there. now we have 4 unnumbered pages, which we'll name p001-p004, and then our numbered pages start on p005. so the numbers should work fine... _except_... no, they don't. why not? well, first, because we have two illustration pages, which were unnumbered in the original p-book... one is named 066.png, and the other is 191.png... (but remember that the numbering is whack here.) they came after pages 64 and 186, respectively, so we will rename them 064a.png and 186a.png, of course. and then, to preserve the recto/verso in the book, we'll introduce blank page-scans as 064b and 186b. that should fix our numbering... but no, it still doesn't work. why not? we have to dig a little deeper... well, because pages 176 and 177, who are known as 180.png and 181.png in the badly-named scan-set, are repeats of 178.png and 179.png. oh-oh, a glitch. (i don't make this up. if i did, you wouldn't believe it. go check for yourself, and you will see it for yourself.) this bug -- accidentally scanning a page-spread twice -- is relatively common. face it, human beings make errors. and it's much better to scan a spread twice than not at all. (which is another relatively-common error.) one of the reasons you want to name your scans wisely is so you can _catch_ these errors as soon as possible... when you are using an intelligent filenaming convention, and external filename mismatches internal pagenumber, you immediately know that you've got an error on the line. the content provider here was using opaque filenames, so he didn't have a clue that he had made that mistake... anyway, so i tossed out the duplicates, and now it works. i've uploaded the scans here, so you can see how it works: > http://z-m-l.com/go/cabin you see i also gave the files full-fledged names, starting with "cabin" as my unique filename used exclusively for this book, and then with the "p" prefix -- for "page"... the unnumbered illustration pages are "cabinp064a.png" and "cabinp186a.png", and their versos are "cabinp064b.png" and "cabinp186b.png"... all other names are transparent: "cabinp###.png" for page ###. so we've got a neatly-structured scan-set, and can go to work. this is the kind of filenaming structure you want to aim for, and it's easiest if you plan ahead so the o.c.r. app will do it for you... -bowerbird ************** Gas prices getting you down? Search AOL Autos for fuel-efficient used cars. (http://autos.aol.com/used?ncid=aolaut00050000000007) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080709/04e479e6/attachment.htm From Bowerbird at aol.com Wed Jul 9 10:25:19 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 9 Jul 2008 13:25:19 EDT Subject: [gutvol-d] how to digitize a book, step by step -- 003 Message-ID: another in-process book being used as a test over at d.p. is this one: > http://www.pgdp.net/c/project.php?id=projectID4873f471bb8c9& detail_level=4 it's called "the crevice". i prepared a properly-named version of it too: > http://z-m-l.com/go/crvic the first numbered page here was page 1 of chapter 1 -- p001.png -- meaning the unnumbered frontmatter pages needed to be named with another prefix. i usually use "f" -- f001.png -- for frontmatter pages... i only kept 4 of the frontmatter pages -- you have to have an _even_ number of pages in each prefix so as to retain the recto/verso mode -- so they're named "crvicf001.png" through "crvicf004.png", of course. (well, actually, f004 is an illustration, so it's a .jpg, so it's "crvicf004.jpg".) so far, in my library, i always include "c001" and "c002" files as well -- for "cover", so it's the front-cover and "hot" (i.e., linked) contents. i will frequently simply duplicate the "f001" or "p001" page for "c001", but in this book there was a scan of the cover, so i used it for c001... anyway, this is just to demonstrate how to handle multiple prefixes in the naming condition. there are lots more wrinkles that can be engineered to handle any special cases you might encounter, but for the most part the filenaming convention is very straightforward, because i intentionally designed it that way, for it to be transparent. you'll also note that this book, like the other one, has _illustrations_ on _unnumbered_pages_, this time located after pages 94 and 262... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080709/2062142b/attachment.htm From Bowerbird at aol.com Thu Jul 10 10:31:55 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Jul 2008 13:31:55 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 008 Message-ID: we shouldn't have a paragraph that starts with a lowercase letter... 8. search for a double-line-end followed by a lowercase letter... 12 cases present. the double-line-end is signified here by "//". > 'Most people go a good length before fighting with//me." > see in my ramshackle house and used up ground, is//over me." > _Mr. Ottinger elected to imbibe his "straight"//from the bottle--it was drunk... > slowly and rolled like a flash over her plastered//skin. > cern now was to get away, to take the money with//him. > sake," Otty gasped, "get to him, the town'll be on//us." > to the door; it said, "Gone fishing. Back to-//morrow." > interior which absorbed them.//fl04] > the other men would hate him; they would all want//me." > that would go twice about the neck and then hang//some." > \//he wouldn't have gone, anyway. > fell sooner and night lingered late into//morning. as you can see, this happens with _short_ lines, one or two words. all 12 of these lines were in error, so were pulled up. in addition, the "mr. ottinger" line was improperly broken _above_ itself as well, and the "\" line was a glitch, so it was deleted, and the "fl04]" line was a botched pagenumber (104), so that was corrected as well... it is worth nothing that this paragraphing check should be done _early_ in the process, because some of the later routines involve checks aimed at the _paragraph_level_ -- e.g., balanced quotes -- so it's important that the paragraphs be correct for them to work. of course, the paragraphs need to be correct for their _own_ sake, it should go without saying. since it is just as easy to ensure that they are correct at the _beginning_ of the workflow as at the _end_, you might as well do it at the beginning. 15 more lines corrected, for a grand total of 77, on 8 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080710/149cf091/attachment.htm From Bowerbird at aol.com Thu Jul 10 13:59:12 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Jul 2008 16:59:12 EDT Subject: [gutvol-d] that didn't take long Message-ID: well, in case you were wondering how long it would take -- once apple opened up iphone for independent developers -- for iphone e-book apps to debut, the answer is "not long". i don't think the store will "officially" open until tomorrow, but there's already news e-books have made an appearance. here's the link for "jane eyre": > http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=284928522&mt=8 and here's "pride and prejudice": > http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=284922530&mt=8 $.99 each. for public-domain books. boing-boing weighs in: > http://gadgets.boingboing.net/2008/07/10/iphone-app-store-sel.html -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080710/fbb6a4cc/attachment.htm From Bowerbird at aol.com Thu Jul 10 15:45:43 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Jul 2008 18:45:43 EDT Subject: [gutvol-d] how to digitize a book, step by step -- 004 Message-ID: ideally, it's best if you straighten and crop your scans, because you will get much better o.c.r. results if you do. it's probably obvious why straight lines are better, since the letters will more closely resemble the "idealized" forms. well-cropped scans also give better results, because margin-marks are not misrecognized, and you can set up separate "zones" to capture the runheads and the pagenumbers that might be located down at the bottom of any pages. that's the point of this message, that you should _scan_and_retain_runheads_. the policy at d.p. is to chop them off before the pages go in front of proofers. that's just misguided. those runheads and pagenumbers give you _bearing_ in navigating the book. they help you avoid getting things badly screwed up. they can be deleted later down in the workflow. leave them in there for now... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080710/b0dd5b88/attachment.htm From Bowerbird at aol.com Thu Jul 10 17:04:52 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 10 Jul 2008 20:04:52 EDT Subject: [gutvol-d] how to digitize a book, step by step -- 005 Message-ID: over at d.p., someone said this: > In pre-processing, I make corrections that can be done: > (a) Without too much effort. If something's going to take a lot of time > to fix in preprocessing, it's a better use of time to have P1 fix it. > (And we're not short of P1'ers). > (b) And there is a high probability that the correction is right, > rather than introducing an error. I don't make automatic changes > in prep that are likely to introduce errors. off the top of the dome, this sounds rather reasonable... but on reflection, it's almost 180-degrees wrong. (about 150.) first, on point (a), it's almost _always_ faster to fix something in preprocessing than having it go in front of the proofers instead. at least it _should_ be. the main reason is that, in preprocessing, the computer is doing the grunt-work of _finding_ the glitches, and -- in the realm of a good text-editor or dedicated tool -- applying the fix is quite straightforward and maximally efficient. moreover, when you step through each type of glitch individually, the process of applying the fix becomes even more streamlined... on point (b), the plain fact is that almost no fix can be done "blind". this doesn't necessarily mean that you have to examine the _scan_, but it _does_ mean that you have to grok the content of the context, and the computer just can't do that. and even if you make a flock of changes without looking at each one _before_ it gets enacted, you must peruse the list of them _afterward_, just to make sure... so, how did this person get derailed into saying what they said? easy. they don't have a good tool to do "preprocessing" at d.p., so he's working under a severe handicap clouding his thinking. his blinders mean he can't see how useful preprocessing can be. a good interface for a decent o.c.r. clean-up tool is fairly simple. you need an editing capability, side-by-side with a scan viewer, and a solid means of isolating and jumping to problematic text... i programmed such a tool years ago -- called "banana cream" -- and i've decided that in the light of recent improvements, i will be releasing a stripped-down version of it to the public very soon... and as the series on "how to do preprocessing" continues, i will incorporate those routines into the program to flesh it out a bit. i could've released this program years ago -- and intended to -- but since there were a several d.p. people among my antagonists here on this listserve, i decided to hold it back instead. in view of their silence recently, there's no need for continued punishment... given my app, d.p. should be able to see how to do preprocessing. -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080710/a6ed5fcb/attachment.htm From vlsimpson at gmail.com Thu Jul 10 22:58:21 2008 From: vlsimpson at gmail.com (V. L. Simpson) Date: Fri, 11 Jul 2008 00:58:21 -0500 Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 006 In-Reply-To: References: Message-ID: On Tue, Jul 8, 2008 at 2:15 AM, wrote: > it's very good to do an early search for garbage characters... > this is a search that benefits from using regular expressions. > the reg-ex is something along these lines: [\&\*\<\>\\\/\|\*\{\}\_] Why all the backslashes in a character class? And why "*" character twice? From Bowerbird at aol.com Thu Jul 10 23:45:21 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Jul 2008 02:45:21 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 006 Message-ID: v.l. said: > Why all the backslashes in a character class? because i don't speak reg-ex fluently? because i tried it and it seemed to work, so heck, that was good enough for me? because i know someone will correct me when i am wrong, or even just somewhat inefficient? because real programmers code our own find routines, due to reg-ex being messy-complexy _and_ poke-slow? > And why "*" character twice? because i want to make sure it gets them _all_... :+) -bowerbird p.s. but seriously, thank you for your input on this; indeed, if you would be so kind as to turn all of my plain-english descriptions into reg-ex, it'd be great for those people out there who _do_ rely on reg-ex. ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080711/a3d7bb3d/attachment.htm From Bowerbird at aol.com Fri Jul 11 00:42:08 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Jul 2008 03:42:08 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- roadmap2 Message-ID: ok, the two parallel proofings of "mountain blood" have concluded. there are 129 differences between the two proofings, shown here: > http://z-m-l.com/go/mount/129_differences_a-vs-b_total.html considering there were 9000+ lines in this book, that's pretty good... more importantly, from my perspective, the preprocessing i did seems to have left _just_3_errors_ in this book of over 360 pages, which these human p1 proofers detected... here they are: > If they took away the chair, Gordon knew, he wag should be: > If they took away the chair, Gordon knew, he was > "Why, damn it fell, Gord!" exclaimed an individual, should be: > "Why, damn it t'ell, Gord!" exclaimed an individual, > grip of these blood-money men; we'll have a state > la wed bank; a rate of interest a man can carry without should be: > grip of these blood-money men; we'll have a state > lawed bank; a rate of interest a man can carry without i'll have more to say about this tomorrow; this is enough for now... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080711/32defd66/attachment.htm From Bowerbird at aol.com Fri Jul 11 09:43:13 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 11 Jul 2008 12:43:13 EDT Subject: [gutvol-d] data on the mountain experiment Message-ID: here's some more data on the differences between the two parallel proofings on the "mountain blood" experiment that roger frank ran over on the d.p. site. before i just showed you the differences: > http://z-m-l.com/go/mount/129_differences_a-vs-b_total.html appended, i show which of the two parallel-proofings got which lines wrong... -bowerbird page:line --> these are the 59 lines that mountain-a got wrong 0003:0011 ALFRED.A.KNOPF 0004:0003 ALFRED A. KNOPF, inc. 0020:0015 It was well known that the first George Gordon Mac-Kimmon--the 0028:0002 'Most people go a good length before fighting with 0041:0001 -*ing or benevolent sentences; these, with appropriate 0091:0020 the palpitating day. One of Gordon's nephews-a 0097:0003 you for that amount,'the skinflint says, and sells 0098:0001 *nuto passage of a symphony; "but it's all one to me 0098:0002 --there's nothing else they can take; I'm free, free 0100:0003 CLARE'S funeral deducted a further sum 0100:0011 or for pleasure. It was the hottest hour of the 0100:0015 been cropping the grass in. the broad, shallow gutter 0116:0001 *-ers, from the farm. As he approached he saw that 0119:0001 *-flected in the warmer tones of his replies; a new 0121:0031 spice. Still his grasp tightened upon her.hand, drew 0124:0001 *-luctant eagerness. He kissed her again and again, 0129:0032 elements, to the bitter mountain winters, the ruthless 0130:0001 suns of the August valleys. He was as seasoned, 0130:0024 Opposite Gordon Malummon sat a slight, feminine 0135:0026 -smart crowds and gay streets and shops on fire 0142:0001 *-ing that it must be a messenger from the village, dispatched 0159:0006 were all stirring him up a little; you didn't say any-thing--" 0160:0004 IT was his own home to which he returned, the 0161:0002 Lattice, in white, with a dark shawl drawn about 0166:0007 an effort to keep his impatience from his voice, "I 0174:0001 *-ing General Jackson at his heels, he picked the dog 0178:0004 THE spring night was potent, warm and 0182:0004 THE memory of Meta Beggs was woven like a 0182:0012 He wished to repay her for that injury to his selfesteem. 0184:0004 HE drove over the road that lay at the base 0191:0004 META BEGGS saw Gordon at the same 0196:0017 "A 'little stroll.' " Buckley produced a heavy 0197:0030 It was seen immediately that the skull was broken-a 0205:0004 ON Sunday he strolled soon after breakfast 0214:0001 *-atory position. He would extract the last penny of 0219:0014 hundred per cent, increase." 0227:0001 *-nolia flowers, would never thicken and grow rough. 0234:0003 RUTHERFORD BERRY and Effie, Barnwell 0238:0001 accomplished fact; Lattice's wishes, her quality of 0249:0004 "I'VE got something for you," Gordon said suddenly. 0249:0030 "I've been thinking of you in-those pretty clothes," 0252:0004 BUT, curiously, sitting alone, he gave little 0253:0013 a little from his blood. She demanded a great deal---a 0255:0004 was insanity. Simeon Caley's wife should 0256:0003 GORDON MAKIMMON made one step toward 0256:0004 her. Lattice held the box in an extended 0264:0003 A HOARSE, thin cry sounded from within 0289:0004 TWENTY-SEVEN hundred and ninety 0291:0013 everywhere; Gordon had pitched the headstal into 0301:0013 Alexander 'll take your horse. He's only at the back 0314:0003 THE year, in the immemorial, minute shifting 0319:0003 GORDON MAKIMMON, absorbed in the 0326:0003 EVEN if he proved able to buy out Simmons, 0333:0002 Mrs. Hollidew in. the sitting room. He would wake 0347:0007 "The two hundred dollar dog!The joke on 0349:0003 THE cold sharpened; the sky, toward evening, 0351:0025 -you can take it or leave it--if you'd drive again? 0356:0004 BUCKLEY SIMMONS was late in arriving 0361:0004 GORDON MAKIMMON rose to a sitting page:line --> these are the 24 lines that mountain-b got wrong 0037:0007 stood the Makimmon dwelling. Originally a foursquare, 0098:0009 They stood before the dark, porchless fagade 0100:0025 in that banal setting, suddenly grew unbearable.... 0100:0026 There was no life in Greenstream.... 0122:0009 But the things I want to hear may not 0123:0003 my heart, something has gone, and 0125:0014 medicine. Wait here for me, I will come 0125:0016 in you. Love makes everything 0142:0032 away, leaving her pale. Her lips trembled, A palpable, 0151:0011 in silkaleen and back in Al mohair, it'll stand you 0155:0035 the options, bring you the result in a couple of weeks.] 0213:0029 the astute storekeeper into such a satisfactory, retail-* 0232:0019 unintelligible period about French widows and pink.... 0232:0020 "Buried before my time," he proclaimed. He 0275:0024 denned her breasts and a hip as crisply as though 0327:0027 He might get them all together, explain, persuade.... 0327:0028 Goddy! it was for their good. They needn't 0331:0025 but not Kenny's for nineteen years." Another bore, 0341:0007 the prospect of release from, its bewildering fullness. 0341:0010 in. the return of the options to a county enhanced 0346:0005 "all them that Pompey had and you bought?" 0349:0012 A thread of light appeared against the fagade of 0363:0019 "Cm on," he called impatiently; "you'll take no 0368:0002 him to where, on. the bureau, a lamp had been left. ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080711/3132ba1d/attachment-0001.htm From Bowerbird at aol.com Fri Jul 11 22:29:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 12 Jul 2008 01:29:42 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 009 Message-ID: 9. search for all lines that start with a space. > ... the strain of lawlessness brought so many years > ... rascal," Gordon heard him mutter, "spendthrift. > ... There was no life in Greenstream.... > ... the medicine. Wait here for me, I will come > ... trust in you. Love makes everything > ... Pompey left one of the solidest estates in this > ... What would you say to a flat eight dollars an > ... was a lucky man, > ... never saw the women > ... Won't you come up and smoke a cigarette? > ... it's rising," he proclaimed, in a loud, singsong > ... now." > ... nobody saw." > ... by the South Fork entrance ... through > ... that is all the Stenton doctor will say; a piece > ... "Buried before my time," he proclaimed. He > ... waiting ... I couldn't wait any longer, Gordon, > ... quick as you can ... the doctor." > ... this time. Tell your husband he can pay me > ... I've got a lot of money laid out. What's been > ... it's the blood. I've studied considerable about > ... never again! I want--" > ... Goddy! it was for their good. They needn't > ... others ... new courage, example of bigness 24 of them. on all of them, the ellipse was at the start of the line, so the space simply needed to be deleted. easy enough. in addition, however, 4 lines had dropped an opening quote: > "... rascal," Gordon heard him mutter, "spendthrift. > "... it's rising," he proclaimed, in a loud, singsong > "... by the South Fork entrance ... through > "... quick as you can ... the doctor." also, 2 more lines were in a poem, so needed to be indented, along with another 2 lines that accompanied them. > ... was a lucky man, > Rip van Winkle ... grummmble > ... never saw the women > At Coney Island swimming ... 32 more lines corrected, for a grand total of 109, on 9 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080712/d9122497/attachment.htm From Bowerbird at aol.com Sat Jul 12 13:54:43 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 12 Jul 2008 16:54:43 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 010 Message-ID: 10. search for all one-character lines. 8 lines present, 3 of which were correct. > X (correct, for chapter x) > O (deleted as incorrect) > X (correct, for chapter x) > V (deleted as incorrect) > \ (deleted as incorrect) > X (correct, for chapter x) > T (deleted as incorrect) > : (moved up to end of previous line) surrounding blank lines were also closed up, where appropriate. 5 more lines corrected, for a grand total of 114, on 10 routines... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080712/42bb67f4/attachment.htm From Bowerbird at aol.com Sun Jul 13 23:27:37 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Jul 2008 02:27:37 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 011 Message-ID: 11. search for all lines with a period-comma, or a comma-period. period-comma: > Gordon's lips formed a silent exclamation.,. > to throw it away--the vultures, Hollidew and Co., > girl until--until Buckley.,. until to-night, now. > Barnwell K., through an oversight, was defrauded > Barnwell K., valiantly endeavoring to emulate his > to its goal,., Gordon saw now that Mrs. Caley > your wife. Miss Beggs oughtn't.,. she isn't anything comma-period: > Gordon's lips formed a silent exclamation.,. > girl until--until Buckley.,. until to-night, now. > to its goal,., Gordon saw now that Mrs. Caley > your wife. Miss Beggs oughtn't.,. she isn't anything 7 lines presented with a period-comma. of those 7, 4 of 'em also presented as containing a comma-period. it is the case that those 4 lines were the incorrect ones, where the misrecognition should have been an ellipse. 4 lines corrected, for a grand total of 118, on 11 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080714/320f6428/attachment.htm From Bowerbird at aol.com Mon Jul 14 09:55:40 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Jul 2008 12:55:40 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 012 Message-ID: 12. search for all lines with a period at the start of the line... > . I wanted to see you; ah, yes." He 41 line-beginning-ellipse > . in the beginning they had let their wide share 44 line-beginning-ellipse > . or for pleasure. It was the hottest hour of the 100 line-beginning-ellipse > . it seemed so useless. You were like a ... a 122 line-beginning-ellipse > . in my heart, something has gone, and 123 line-beginning-ellipse > .stones, wedding bands, gold pins and 238 nothing(speck) > . it was insanity. Simeon Caley's wife should 255 line-beginning-ellipse of the 7 lines presenting, 6 were cases where the line-starting period was actually a line-starting ellipses, and they were changed accordingly. in the 7th (.stones), the line-starting period was a speck, so was deleted... 7 more lines corrected, for a grand total of 125, on 12 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080714/4e11dcc3/attachment.htm From Gutenberg9443 at aol.com Mon Jul 14 12:16:26 2008 From: Gutenberg9443 at aol.com (Gutenberg9443 at aol.com) Date: Mon, 14 Jul 2008 15:16:26 EDT Subject: [gutvol-d] continued confusion over at distributed proofreaders Message-ID: In a message dated 7/1/2008 9:30:32 P.M. Mountain Daylight Time, gbuchana at teksavvy.com writes: would like to see a system like DP actually _introduce_ a specific known error or two into each page and not accept the page until the proofers had found and corrected it. I want the system to be able to verify that a known level of dilligence is being taken. This is an extremely good idea. When I was a police officer I was taught to put a deliberate typo on every page of a statement or confession and then have the person making the statement correct and initial the typo. That way I could demonstrate to a jury that the person did have the chance to read and if necessary correct errors--and since people never tell the same story twice exactly the same way, unless they're bards, often real errors were caught and corrected while doing this. Anne **************Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080714/c8fc6e25/attachment.htm From jeroen.mailinglist at bohol.ph Mon Jul 14 14:47:46 2008 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Mon, 14 Jul 2008 23:47:46 +0200 Subject: [gutvol-d] continued confusion over at distributed proofreaders In-Reply-To: References: Message-ID: <487BC982.5070003@bohol.ph> A good way to get controversial or bad reports approved is to introduce some obvious problems (typo's, etc.) in some places. This will then distract the attention from the real issues, and move the thing through bureaucracy, with people happy they've been able to make some comments and improve it.... It can work both ways..... Jeroen. Gutenberg9443 at aol.com wrote: > > In a message dated 7/1/2008 9:30:32 P.M. Mountain Daylight Time, > gbuchana at teksavvy.com writes: > > would like to see a system like DP actually _introduce_ a > specific known error or two into each page and not accept > the page until the proofers had found and corrected it. I > want the system to be able to verify that a known level of > dilligence is being taken. > > > > This is an extremely good idea. When I was a police officer > I was taught to put a deliberate typo on every page of a > statement or confession and then have the person making > the statement correct and initial the typo. That way I could > demonstrate to a jury that the person did have the chance > to read and if necessary correct errors--and since people > never tell the same story twice exactly the same way, > unless they're bards, often real errors were caught and > corrected while doing this. > > Anne > > > > **************Get the scoop on last night's hottest shows and the live music > scene in your area - Check out TourTracker.com! > (http://www.tourtracker.com?NCID=aolmus00050000000112) > > > ------------------------------------------------------------------------ > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Mon Jul 14 15:26:50 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 14 Jul 2008 18:26:50 EDT Subject: [gutvol-d] continued confusion over at distributed proofreaders Message-ID: as i said in response to this originally, this "solution" isn't needed, because proofers _are_ paying good attention, as evidenced by their _high_accuracy_. moreover, with good "preprocessing", which can take the error-rate down to next-to-nothing before proofers get it, a one-round proofing will be sufficient. actually, it's more like a smooth-reading. as example, consider "blood mountain". there were 3 errors after preprocessing. both parallel proofings found all of 'em. you're being distracted by a concern over a nonexistent "problem". open your eyes. -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080714/f993ef48/attachment.htm From schultzk at uni-trier.de Tue Jul 15 01:48:11 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Tue, 15 Jul 2008 10:48:11 +0200 Subject: [gutvol-d] continued confusion over at distributed proofreaders In-Reply-To: References: Message-ID: <4F77EA90-819A-4F32-A995-91F3278EC13E@uni-trier.de> Hi Everbody, I agree with Bowerbird. There is no need for such "quaility control" I do admit that some of the arguments mentioned have thier place in the situations mentioned, yet they are not applicable to DP. DP text are proofed by, in general, by a couple of proofers. Therefore it is redundant and quality control is overt. Futhermore the proofers are not stressed to finish up nor are they not allowed to go back and check again in case they are unsure! regards Keith. Am 15.07.2008 um 00:26 schrieb Bowerbird at aol.com: > as i said in response to this originally, > this "solution" isn't needed, because > proofers _are_ paying good attention, > as evidenced by their _high_accuracy_. > > moreover, with good "preprocessing", > which can take the error-rate down to > next-to-nothing before proofers get it, > a one-round proofing will be sufficient. > actually, it's more like a smooth-reading. > > as example, consider "blood mountain". > > there were 3 errors after preprocessing. > both parallel proofings found all of 'em. > > you're being distracted by a concern over > a nonexistent "problem". open your eyes. > > -bowerbird > > > > ************** > Get the scoop on last night's hottest shows and the live music > scene in your area - Check out TourTracker.com! > (http://www.tourtracker.com?NCID=aolmus00050000000112) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080715/ce65d24a/attachment.htm From Bowerbird at aol.com Tue Jul 15 02:37:02 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Jul 2008 05:37:02 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 014 Message-ID: capital-i, followed by a number of low-probability letters... 14. I[abcdeghijklopquvwxyz] > face, with its heavy, good features and slow-Idndling > Ill > Ill > "I do! Idol" He turned and left them, striding > Ill 5 lines presented, with each of the 5 containing an error... 5 more lines corrected, for a grand total of 130, on 14 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080715/4e20692a/attachment.htm From Bowerbird at aol.com Tue Jul 15 09:33:50 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Jul 2008 12:33:50 EDT Subject: [gutvol-d] the proofreaders are not the problem at distributed proofreaders Message-ID: the proofreaders are _not_ the problem at distributed proofreaders, no sir... the problem is the awful workflow to which the proofers are being subjected. _all_ of the data i have analyzed from the various d.p. experiments -- and it has been a lot of data, i know, i don't blame you if you haven't followed it all -- has made it abundantly clear that the proofers are doing a _great_ job... if i were to grade their performance, i would give them a good solid "a"... they don't get it all right the first time, but they rarely introduce any errors. the d.p. _administrators_, though, have a significantly worse track-record. in 2003, i would have given them a "b", based on the big implicit potential. by 2004, their grade had dropped to a "b-". by 2005, a "c+". 2006, a "c". 2007 would have netted them a "c-". and now in 2008, it's clearly a "d"... consider the "how-to-preprocess" series that i've been running recently... i've already listed over a dozen simple, predictable routines to find errors. all of 'em should be immediately obvious to any person familiar with o.c.r. every one has found errors in the text against which they're being tested, and returned very few false-alarms. so one simple question poses itself: why were _none_ of these routines used in the preprocessing of this text? seriously, haven't the administrators at d.p. learned _anything_ about finding and fixing errors in o.c.r.? they've digitized literally _thousands_ of books, yet they don't have the most primitive of routines in place yet... they should be _extremely_embarrassed_ by their miserable performance. instead of using the computer to find and fix glitches, they leave it to their human volunteers. this is a waste of the resources being donated to them. indeed, more than being embarrassed, they should be ashamed. -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080715/d47d43f3/attachment.htm From Bowerbird at aol.com Tue Jul 15 10:47:08 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 15 Jul 2008 13:47:08 EDT Subject: [gutvol-d] what d.p. and rfrank need to do -- a 10-point plan Message-ID: here's what roger frank needs to do to get his preprocessing program going... 1. clean up the paragraphs. (have to do it sooner or later, so do it sooner.) 2. put the top-blank-line on appropriate pages. (so proofers don't have to.) 3. clear up the spacey quotes. (literally _hundreds_and_hundreds_ of these.) 4. standardize ellipses. (so proofers skip the merry-go-round of changes.) 5. standardize em-dashes. (here too, skip the changes merry-go-round.) 6. dehyphenate. (or, better yet, delay that step until _after_ the proofing.) 7. "clothe" hyphens. (or, better yet, just stop doing that stupid d.p. policy.) 8. run the routines that find the obvious o.c.r. errors (as i've demonstrated.) 9. do a much better job of formulating the "good words" list. (saves time!) 10. congratulate yourselves for a job well done... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080715/f6316429/attachment-0001.htm From Bowerbird at aol.com Wed Jul 16 01:38:15 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 16 Jul 2008 04:38:15 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 015 Message-ID: 15. search for all lines with a period-whitespace followed by lowercase, except controlling for cases where it was ellipse-whitespace-lowercase... 6 of the presenting cases were specks misrecognized as a period, which, of course, is one of the main targets of this search routine: > been cropping the grass in. the broad, shallow gutter > himself pointedly in. its defiance. > Mrs. Hollidew in. the sitting room. He would wake > in. the return of the options to a county enhanced > quickly away; the. house was without a > him to where, on. the bureau, a lamp had been left. 5 of the presenting cases were ones where the period was really an ellipse: > . in the beginning they had let their wide share > . or for pleasure. It was the hottest hour of the > . it seemed so useless. You were like a -- a > . in my heart, something has gone, and > . it was insanity. Simeon Caley's wife should 2 of the presenting cases were other instances of a misrecognized ellipse: > girl until--until Buckley.,. until to-night, now. > your wife. Miss Beggs oughtn't.,. she isn't anything the remaining 3 of the presenting cases were _correct_, as they involved a last-name represented as a single letter: > I had promised to bring Barnwell K. the next time." > red cloth; on one side Barnwell K. sat flanked by > K. and the delicate Rose, left after 13 more lines corrected, for a grand total of 143, on 15 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080716/c1129ed5/attachment.htm From russellbell at gmail.com Wed Jul 16 17:46:53 2008 From: russellbell at gmail.com (Russell Bell) Date: Wed, 16 Jul 2008 18:46:53 -0600 Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and Superman' Message-ID: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com> 'Maxims for Revolutionists' is an appendix to Shaw's 'Man and Superman' but it is not included in Gutenberg's edition thereof. Why? From gbnewby at pglaf.org Wed Jul 16 19:03:03 2008 From: gbnewby at pglaf.org (Greg Newby) Date: Wed, 16 Jul 2008 19:03:03 -0700 Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and Superman' In-Reply-To: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com> References: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com> Message-ID: <20080717020302.GA15135@mail.pglaf.org> On Wed, Jul 16, 2008 at 06:46:53PM -0600, Russell Bell wrote: > 'Maxims for Revolutionists' is an appendix to Shaw's 'Man and Superman' > but it is not included in Gutenberg's edition thereof. Why? I don't really know. But most likely this was because whoever digitized the text didn't provide the appendix. -- Greg From hyphen at hyphenologist.co.uk Wed Jul 16 23:50:46 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Thu, 17 Jul 2008 07:50:46 +0100 Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and Superman' In-Reply-To: <20080717020302.GA15135@mail.pglaf.org> References: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com> <20080717020302.GA15135@mail.pglaf.org> Message-ID: <000901c8e7d9$7d4f2f00$77ed8d00$@co.uk> Greg Newby wrote >On Wed, Jul 16, 2008 at 06:46:53PM -0600, Russell Bell wrote: >> 'Maxims for Revolutionists' is an appendix to Shaw's 'Man and Superman' >> but it is not included in Gutenberg's edition thereof. Why? >I don't really know. But most likely this was because whoever >digitized the text didn't provide the appendix. -- Greg Is Russell volunteering to do it? Somebody should! Dave Fawthrop. From jayvdb at gmail.com Thu Jul 17 00:05:11 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Thu, 17 Jul 2008 17:05:11 +1000 Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and Superman' In-Reply-To: <000901c8e7d9$7d4f2f00$77ed8d00$@co.uk> References: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com> <20080717020302.GA15135@mail.pglaf.org> <000901c8e7d9$7d4f2f00$77ed8d00$@co.uk> Message-ID: On Thu, Jul 17, 2008 at 4:50 PM, Dave Fawthrop wrote: > > Greg Newby wrote > >>On Wed, Jul 16, 2008 at 06:46:53PM -0600, Russell Bell wrote: >>> 'Maxims for Revolutionists' is an appendix to Shaw's 'Man and > Superman' >>> but it is not included in Gutenberg's edition thereof. Why? > >>I don't really know. But most likely this was because whoever >>digitized the text didn't provide the appendix. > -- Greg > > Is Russell volunteering to do it? Somebody should! Wikisource has the complete text; we used the PG text and bartleby for the appendixes: http://www.bartleby.com/157/index.html -- John From Bowerbird at aol.com Thu Jul 17 01:13:13 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 17 Jul 2008 04:13:13 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 016 Message-ID: ok, let's spend a few days talking about how to clean up the paragraphing. basically, this means that you've got an empty line between all paragraphs, and no empty lines within any paragraphs. the o.c.r. gets most of this right, most of the time, and the exceptions are fairly easy to detect automatically... once you've got the paragraphing correct, you can then have the machine go ahead and fix almost all the spacey quotes automatically and correctly... that's a good thing. (in our current test book, there were no spacey quotes, but in some of the other test books, there are many, sometimes over 1000.) our first test is lines after a blank line which start with a lower-case character. 16. double-line-end (here signified by "//") followed by lowercase > 'Most people go a good length before fighting with//me." > see in my ramshackle house and used up ground, is//over me." > _Mr. Ottinger elected to imbibe his "straight"//from the bottle--it was drunk with > mutual assurances > slowly and rolled like a flash over her plastered//skin. > cern now was to get away, to take the money with//him. > sake," Otty gasped, "get to him, the town'll be on//us." > to the door; it said, "Gone fishing. Back to-//morrow." > interior which absorbed them.//fl04] > the other men would hate him; they would all want//me." > that would go twice about the neck and then hang//some." > \//he wouldn't have gone, anyway. > fell sooner and night lingered late into//morning. all of these 12 cases were ones where a paragraph was incorrectly split, so they were rejoined. 12 more lines corrected, for a grand total of 155, on 16 routines... be back tomorrow with the next tip in this series... -bowerbird ************** Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com! (http://www.tourtracker.com?NCID=aolmus00050000000112) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080717/87d477d1/attachment.htm From hart at pglaf.org Thu Jul 17 09:44:01 2008 From: hart at pglaf.org (Michael Hart) Date: Thu, 17 Jul 2008 09:44:01 -0700 (PDT) Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and Superman' In-Reply-To: <20080717020302.GA15135@mail.pglaf.org> References: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com> <20080717020302.GA15135@mail.pglaf.org> Message-ID: Actaully, this is a GREAT appendix, and if anyone is willing to work on it, I will help walk it through. . . . Thanks!!! Michael On Wed, 16 Jul 2008, Greg Newby wrote: > On Wed, Jul 16, 2008 at 06:46:53PM -0600, Russell Bell wrote: >> 'Maxims for Revolutionists' is an appendix to Shaw's 'Man and Superman' >> but it is not included in Gutenberg's edition thereof. Why? > > I don't really know. But most likely this was because whoever > digitized the text didn't provide the appendix. > -- Greg > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From russellbell at gmail.com Thu Jul 17 18:26:12 2008 From: russellbell at gmail.com (Russell Bell) Date: Thu, 17 Jul 2008 19:26:12 -0600 Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and Superman' Message-ID: <688269960807171826s1c133dfx41cc90288a81ae88@mail.gmail.com> I just downloaded it and the other appendix from Bartleby's. I'll look up Gutenberg's rules for submission. Should I make them separate items or add them to 'Man and Superman'? From Bowerbird at aol.com Fri Jul 18 02:51:52 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 18 Jul 2008 05:51:52 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 017 Message-ID: as our next step in ensuring the integrity of our paragraphing, we will search for paragraphs that were not terminated with something we consider "reasonable", which would be a period, exclamation-point, or question-mark, or any of those three things followed by a single-quote-mark and/or a double-quote-mark... 17. paragraphs terminated incorrectly you'll see that we get a lot of hits here, so many that i've appended them, after sorting them into some basic categories, which i will discuss here... so, we churned up lots of stuff... first, titles and forward matter spring up, but we weed those out quickly, since they aren't really "paragraphs" at all... next, there are a lot of lines that "end" with a colon. they're very plentiful. the colon signifies that "a block of some type follows this". sometimes it's just a brief statement from a person, mere dialog. but other times it can be quite an extensive block, such as a letter or a sign or a telegram or whatever. so the colon is legitimate here, but it will also be useful to us later, when we do _formatting_ -- because that "block that follows" will need to be treated -- so it's a good thing we discovered this quick method of finding those blocks. the next category is the _em-dash_, which is also a legitimate termination, so we'll add those checks to the routine in the future. this is how you learn. we also get one _en-dash_, which would be an invalid termination, except it's actually an o.c.r. error, so we will just fix it, thereby removing the flag... we also get -- and fix -- a misrecognition of exclamation-point as capital-i. and a misrecognition of a period as a comma, so we've now fixed three lines. we've also located a poem, consisting of 4 lines, currently split apart but needing to be rejoined, so we eliminate the 3 blank lines separating them. we found one paragraph that ended in garbage, so we corrected that glitch. we also had a broken paragraph that we fixed, so we're up to 8 lines now... we drop 4 one-character lines, so we're at 12. we eliminate some drop-cap garbage, and an excess runhead, so 14... all in all, an interesting hodge-podge of lines popping up from that search, and 14 lines corrected, so well worth the effort of sorting through the stuff... 14 more lines corrected, for a grand total of 169, on 17 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird 17. paragraphs terminated incorrectly *** colon (so a legitimate termination) and familiar figures: past its banks. Then: solemn eyes: spoke in a strain of querulous sweetness: audible in broken phrases: She said promptly: toward the lower level. Then: distance away: formed words: composure with a struggle: personality. He heard Simmons say: she drew him to her: hand: words: formed it soundlessly, he even spoke it aloud: only by Gordon's breathing: Then he went to the door: exactly the same manner: She turned her face from him. He said: be a fortune." Silence fell upon them. Then: : He grew silent, enveloped in thought. Then: the sitting room, where he stood lost in thought: dangerous murmur rose: said: him: *** em-dash (so a legitimate termination) she had kissed him for a pair of silk stockings-- his desire, his-- printed a deliberate--a deliberate-- *** misrecognition of an em-dash (which is legitimate) as an en-dash (which is not) of a thing to go and do! ... off horse ... willing-" *** misrecognition of exclamation-point as a capital-i I'm no sheep to drive into their lot and shear I" *** misrecognition of a period as a comma He heard a murmur from the back of the throng, ~~ "Give it to him, we didn't come here to talk.@ *** poem (which should be joined into a single block) ... was a lucky man, ~~ Rip van Winkle ... grummmble Rip van Winkle ... grummmble ~~ ... never saw the women ... never saw the women ~~ At Coney Island *** garbage the trap. _{t} ~~ The bitter irony of it rose in a wave of black mirth *** broken paragraph (i.e., garbage) served in two glasses and a cracked toothbrush mug ~~ _Mr. Ottinger elected to imbibe his "straight" *** single-character lines (i.e., garbage) O ~~ "I got it," he interrupted her tersely, "and I V ~~ BUT, curiously, sitting alone, he gave little \ ~~ he wouldn't have gone, anyway.@ T ~~ XI *** drop-cap garbage ' 'TT'VE got something for you," Gordon said sud- ~~ I denly.@ *** runhead (i.e., garbage) MOUNTAIN BLOOD ~~ "I don't choose to be," Meta Beggs retorted. "I ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080718/0f388ae6/attachment.htm From Bowerbird at aol.com Fri Jul 18 13:48:29 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 18 Jul 2008 16:48:29 EDT Subject: [gutvol-d] lots of good stuff next week Message-ID: there'll be lots of good stuff next week... first, "the crevice" -- another rfrank experiment at d.p. -- has made its way rapidly through both p1 and p2 now... i actually re-did the o.c.r. on this book -- as part of my ongoing series on "how to digitize a book, step by step", so i'll be continuing that series using this live example... also, "the cabin on the prairie" -- another rfrank test -- has finished p1, so i'll be able to make comments on it... neither of these books got good preprocessing on them -- which is why i re-did the o.c.r. -- so i won't examine the (hundreds and hundreds) of _unnecessary_changes_ that the proofers had to do (e.g., rejoining hyphenates), because if that b.s. doesn't already _stink_badly_ to you, your nose isn't working correctly. what i _will_ do is show -- like i did on "mountain blood" -- that if you do the preprocessing correctly, you transform the "proofing" job into something where the proofers can concentrate on the job of _perfecting_the_book_ instead of just removing the obvious crap on all the individual pages and leaving the "perfecting" task to the next person in line... i'll also continue my "how to clean up the o.c.r." series, so you know _exactly_ how to _do_ that good preprocessing. lots of sleeves-up fun here in the lobby of the p.g. library... have a good weekend... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080718/fb216785/attachment.htm From russellbell at gmail.com Fri Jul 18 14:52:18 2008 From: russellbell at gmail.com (Russell Bell) Date: Fri, 18 Jul 2008 15:52:18 -0600 Subject: [gutvol-d] submitting a text Message-ID: <688269960807181452q78a8cf5al54a2372d61ae05a8@mail.gmail.com> I downloaded Bartleby's copies of 'Maxims for Revolutionists' and 'Revolutionists' Handbook and Pocket Guide', formatted them in accord with the rules, gutchecked them, made iso8859 & ASCII copies. Downloaded a copy of the image of an original edition from googlebooks for comparison. Now where do I send them? The FAQ tells me to e-mail them to any member of the posting team but gives no addresses for any of them. From ajhaines at shaw.ca Fri Jul 18 15:53:01 2008 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Fri, 18 Jul 2008 15:53:01 -0700 Subject: [gutvol-d] submitting a text References: <688269960807181452q78a8cf5al54a2372d61ae05a8@mail.gmail.com> Message-ID: <000801c8e929$08661d60$6401a8c0@ahainesp2400> Russell, before you can submit your files to PG, you'll have to get copyright clearances for both books. You can do that at http://upload.pglaf.org/. - click the "New username" link to set yourself up with an account. - click the "Login" link to log in with your new username and password. - click the "welcome" link at the top of the page. - in the News section, click on "new copyright clearance system" - for each book, click on "submit a new clearance request", and fill in the form. You'll have to have scans of each book's title and copyright page, the latter usually being on the back (verso) of the title page. - click the "logout" link at the top of the form when you're finished. It may take several days, but you'll get back an e-mail for each book saying whether the clearance is "OK" or "Not OK". (The reasons for a "Not OK" are beyond the scope of this message.) Zip all the files (ASCII, Latin1, HTML, etc) for a given book into a single file for uploading. Log into the upload page above, and click the "Get status of my prior clearance requests" link. Click the Cleared link to see your clearances, then on whichever book you want to upload to the Whitewashers. Click on the book's Clearance OK link, and fill in the upload form. In step 2 of the form, select ASCII, Latin1 (ISO8859), or whatever's appropriate for that book. (Don't select Other.) In steps 3 and 4, check that the info is correct. Fill in Steps 5 and 6 as you see fit. At step 7, if the submission includes an HTML file (with or without illustrations), it's recommended that you do a Preview submission to check the HTML's validity. If it's OK, fill in step 1 again, then click the Submit eBook button. The Whitewashers (one of whom is me) will get an email notifying us that a new submission has come in. Depending on the volume of new submissions, it may take us a day or two to handle yours. FYI - you don't necessarily have to generate your own ASCII files. The Whitewashers routinely generate ASCII files from ISO8859 files as part of the posting process. It's only when that conversion proves difficult, for whatever reason, that we ask the submitter to prepare and submit their own ASCII file, along with the Latin1 file. Al ----- Original Message ----- From: "Russell Bell" To: Sent: Friday, July 18, 2008 2:52 PM Subject: [gutvol-d] submitting a text >I downloaded Bartleby's copies of 'Maxims for Revolutionists' and > 'Revolutionists' Handbook and Pocket Guide', > formatted them in accord with the rules, gutchecked them, made iso8859 > & ASCII copies. Downloaded a > copy of the image of an original edition from googlebooks for > comparison. Now where do I send them? The > FAQ tells me to e-mail them to any member of the posting team but > gives no addresses for any of them. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Sat Jul 19 11:20:11 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 19 Jul 2008 14:20:11 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 018 Message-ID: saturday, so let's take a big bite for the weekend... the paragraphs should be all set now... a quick visual scan through the entire book will inform you of any remaining errors in the flow... we will be slightly modifying the level of description we've been using. so far, we've talked about _lines_. now, we'll talk about _paragraphs_. we _define_ a paragraph as "any line or lines that occur _between_ empty lines", so there has been a _correspondence_ of the two, and you can still program the paragraph routines _in_terms_of_ "lines", but it's easier to say what needs to be said if we call 'em "paragraphs". today, we're not really gonna bring up "hits" and evaluate what to do. we're just gonna describe in plain english words what a tool will do... specifically, this tool will "fix" the spacey-quotes in our o.c.r. output. and this is how it does it. first, it examines paragraph-by-paragraph. within each paragraph, it counts double-quote-marks. (single later.) it goes on to evaluate each quotemark, to determine whether it is: 1. a spacey-quotemark. (one that has whitespace on both sides.) 2. an open-quotemark. (whitespace to the left, letters to the right.) 3. a close-quotemark. (letters to the left, whitespace to the right.) (things get a little more complicated when you have nested quotes, and markup characters, but the basics work well most of the time.) then lastly, the routine evaluates whether all _odd-numbered_ quotes are open-quotes and all _even-numbered_ quotes are close-quotes; if so, then it assigns the spacey-quotemarks their respective status... if not, then it summons these questionable quotes to your attention. in addition, when there is an odd number of quotes in the paragraph, it checks to make sure the next paragraph starts with a quote-mark, and -- if not -- summons that paragraph to your attention as well... because of the redundancy built into this routine, it is _very_ robust. it will basically _find_ more mistakes in the text than it will _cost_ you in erroneous assignments. and it can auto-fix _hundreds_ of errors, -- hundreds and hundreds and hundreds of errors -- in no time flat. assuming, that is, that there _are_ some spacey-quotes in your o.c.r. "mountain blood", however, had no spacey-quotes. so none were fixed. 18. fix the spacey quotes. 0 more lines corrected, for a grand total of 169, on 18 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080719/c1c2cab7/attachment.htm From Bowerbird at aol.com Sun Jul 20 21:52:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Jul 2008 00:52:03 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 019 Message-ID: today we talk about paragraph-breaks that occur on page-breaks... the preprocessing done on "mountain blood" followed the rule that any page that started with a capital letter or a double-quotemark got a line inserted above it, the assumption being it was a new paragraph. that's a pretty good rule, but it's not the best one you can use, since it creates a number of unnecessary false-alarms. it's far better to check the last line of the _previous_ page to see if it ends with a paragraph-termination. if it does, and the current page starts with a capital letter or a double-quotemark, add the line then. if the previous page is paragraph-terminated, but the current one doesn't look like the start of a paragraph, then summon a human... of course, it's entirely possible for a _sentence_ to end on one page, with the next sentence then beginning on the top of the next page, with both sentences within the same paragraph. you can do a check for these exceptions by checking the _length_ of the terminating line; if it's a long line, then there's a good chance that it is mid-paragraph. however, there are exceptions to these exceptions as well, meaning it's good to do one last visual confirmation of each page in the book. but other than this set of wrinkles, _most_ of the paragraph-breaks (or non-breaks) that occur on a page-break can be auto-detected... 19. check the paragraph-breaks that occur on page-breaks. this turned up a number of hits, which i've appended... in the first group, we find 2 lines where the paragraph-terminating period was misrecognized as a comma, so we will correct those two. in the second group, we have 7 lines that were clearly _false-alarms_, since the previous page was not paragraph-terminated. i fixed them. next, by searching for lines that _are_ paragraph-terminated, but which are also _long_lines_, we find and fix another 8 false-alarms. as exceptions to exceptions, not _all_ the long lines are false-alarms; the last group shows 2 long lines that did indeed end the paragraph, even though examination of the preceding page won't tell you that, you can only see it by going to the following page where you observe that yes, indeed, there is a new paragraph-start at the top of the page, as indicated by indentation of the paragraph. (there might have been more than these two examples, but i didn't think to save earlier ones.) 17 more lines corrected, for a grand total of 186, on 19 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird *** a period-termination was misrecognized as a comma > radiant content settled upon her, > > http://z-m-l.com/go/mount/mountp142.html > > "Thank you," she told him seriously; "it will > "We've never been storekeepers," > > http://z-m-l.com/go/mount/mountp321.html > > "Never kept much of anything, have you, any of *** non-termination, incorrectly coded as paragraph-breaks... > it toward him. "In Greenstream," he continued, > http://z-m-l.com/go/mount/mountp121.html > "men don't like me, they are afraid of me; but the > "Are you going to the camp meeting on South > http://z-m-l.com/go/mount/mountp179.html > Fork next week?" she demanded. "I have never > a man will murder me," she replied in level tones; > http://z-m-l.com/go/mount/mountp210.html > "perhaps I'll get a thrill from that." Her voice > "You could study a life on women," Rutherford > http://z-m-l.com/go/mount/mountp225.html > Berry pronounced, "and never come to any satisfaction. > "And you go right around, Alec," his wife added, > http://z-m-l.com/go/mount/mountp301.html > "and twist the head off that dominicker chicken. > shade of minute, variously-colored silks the effigy of > http://z-m-l.com/go/mount/mountp304.html > Mrs. Hollidew dead. Undisturbed in the film of > "I say I wanted to see you," the voice persisted; > http://z-m-l.com/go/mount/mountp366.html > "it's Edgar Crandall. You'll take pleasure from *** not a paragraph-break, just a long mid-paragraph line > She started toward him in an excess of tender pity. > http://z-m-l.com/go/mount/mountp143.html > "Do you care as much as that?" She laid her > "You're getting on the money now, are you? > http://z-m-l.com/go/mount/mountp172.html > Going to start that song? That'll come natural to > He entered the room. It was, he divined, hers. > http://z-m-l.com/go/mount/mountp176.html > His foot struck against a chair, and his hand caught > a lithe, wicked hatred in any other human being. > http://z-m-l.com/go/mount/mountp209.html > "You are a gentle object," he satirized her, loosening > she sought his lips. "Soon again," she murmured. > http://z-m-l.com/go/mount/mountp214.html > "Don't desert me; I am entirely alone except for > The postmaster laid it on top of the glass case. > http://z-m-l.com/go/mount/mountp238.html > "The jobber sent it up by accident," he explained; > hanging limply, breathing in sharp inspirations. > http://z-m-l.com/go/mount/mountp260.html > She gazed about at the valley, the half-distant maple > endeavor to instil into her some of his warmth. > http://z-m-l.com/go/mount/mountp276.html > He gazed at her for a moment, at the shadows like *** long lines, but ones that were actually paragraph-breaks. > pale orange paper, pinched between withered fingers. > > http://z-m-l.com/go/mount/mountp329.html > > Suddenly he was in a hurry to get away; he drew > was the power, the unconquerable godhead, of gold. > > http://z-m-l.com/go/mount/mountp339.html > > The thought of the storekeeper was lost in the ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080721/4226b0f6/attachment.htm From Bowerbird at aol.com Mon Jul 21 09:55:57 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Jul 2008 12:55:57 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 020 Message-ID: we shouldn't have a line that starts with a comma, should we? 20. search for all lines that start with a comma. > philosophy underlying them, any ruthless strength, > , escaped him entirely. They appealed solely to him one of them, fixed. (a speck on the page caused the glitch.) 1 more line corrected, for a grand total of 187, on 20 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080721/322a0f4e/attachment.htm From hart at pglaf.org Mon Jul 21 10:05:37 2008 From: hart at pglaf.org (Michael Hart) Date: Mon, 21 Jul 2008 10:05:37 -0700 (PDT) Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 020 In-Reply-To: References: Message-ID: I'm still hoping you will send me a list of all 20. Thanks!!! Michael From hart at pglaf.org Mon Jul 21 10:39:40 2008 From: hart at pglaf.org (Michael Hart) Date: Mon, 21 Jul 2008 10:39:40 -0700 (PDT) Subject: [gutvol-d] submitting a text In-Reply-To: <688269960807181452q78a8cf5al54a2372d61ae05a8@mail.gmail.com> References: <688269960807181452q78a8cf5al54a2372d61ae05a8@mail.gmail.com> Message-ID: When in doubt, you can always send them to me. Michael S. Hart Founder Project Gutenberg hart at pglaf.org On Fri, 18 Jul 2008, Russell Bell wrote: > I downloaded Bartleby's copies of 'Maxims for Revolutionists' and > 'Revolutionists' Handbook and Pocket Guide', > formatted them in accord with the rules, gutchecked them, made iso8859 > & ASCII copies. Downloaded a > copy of the image of an original edition from googlebooks for > comparison. Now where do I send them? The > FAQ tells me to e-mail them to any member of the posting team but > gives no addresses for any of them. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Mon Jul 21 15:46:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Jul 2008 18:46:42 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 020 Message-ID: michael said: > I'm still hoping you will send me a list of all 20. for the founder, i can do that, yes. we will end up with more than 20. for everyone else, collecting them is a matter of having the dedication. i did the work to define the set for this particular "mountain blood" test; that's the hard part; the mere act of collecting my posts is the easy part. but there's little reason for anyone to collect these, unless they plan on programming their own tool... i'll be releasing a version of my tool, which incorporates these routines (and more), which anyone can use to clean up an o.c.r. text, so use that. -bowerbird p.s. if anyone else does want to capture all of the posts in this series, i'd suggest the july digest in the archives... p.p.s. for more routines, you can check out gutcheck and guiguts: > http://gutcheck.sourceforge.net/index.html > http://home.comcast.net/~thundergnat/guiguts.html ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080721/9a7335ba/attachment.htm From Bowerbird at aol.com Mon Jul 21 20:14:19 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 21 Jul 2008 23:14:19 EDT Subject: [gutvol-d] a few more references Message-ID: i said: > p.p.s.? for more routines, you can check out gutcheck and guiguts: >?? http://gutcheck.sourceforge.net/index.html >?? http://home.comcast.net/~thundergnat/guiguts.html you can also look here, on the distributed proofreader forums: > http://www.pgdp.net/phpBB2/viewtopic.php?p=331320 > http://www.pgdp.net/phpBB2/viewtopic.php?p=332044 as you can see, it was over a year ago i was bringing this up directly to d.p., back when they "allowed" me to do it directly -- and roger frank was "working on it" even way back then -- with various people chipping in offering several good routines, but somehow a whole year has gone by with nothing being done. actually, i've been "bringing this up" to d.p. for many years now, with "nothing being done" being precisely what was (not) done. come back in a year from now and see if they've done anything... -bowerbird p.s. by the way, it's _not_ the case that "more" routines is "better", because you start to run into the "false alarm" problem before long. but, you know, have at it with all the routines... ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080721/ba4bd4bd/attachment.htm From schultzk at uni-trier.de Mon Jul 21 23:14:15 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Tue, 22 Jul 2008 08:14:15 +0200 Subject: [gutvol-d] a few more references In-Reply-To: References: Message-ID: <1539A2F1-DC7E-413C-B203-930D442D5017@uni-trier.de> Hi Bowerbird, Am 22.07.2008 um 05:14 schrieb Bowerbird at aol.com: > > p.s. by the way, it's _not_ the case that "more" routines is > "better", > because you start to run into the "false alarm" problem before long. > but, you know, have at it with all the routines... As you mentioned before the routines you have mentioned are the easy part !! The hard part is making the improved mouse trap so that you do not have those false alarms! Furthermore, if one can not automatically distinguish between a true case and false alarm then this is a true case for human intervention and should be flagged. True, it is annoying and it slows down the process, yet it adds to the qualitiy of the result. regards Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/c9d9b87b/attachment.htm From Bowerbird at aol.com Tue Jul 22 01:24:25 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Jul 2008 04:24:25 EDT Subject: [gutvol-d] a few more references Message-ID: keith said: > if one can not automatically distinguish between > a true case and false alarm then this is a true case > for human intervention and should be flagged. well, actually, aside from the spacey-quote corrections, _all_ of these fixes are done in a human-mediated way. the tool will take you to each glitch, show you the scan, and position the cursor in the field for you to do an edit. so each decision on these -- to edit or not -- is _informed_; you have examined text and scan, so you know the score... these glitches are treated as "false alarms" until confirmed. (even though the number of _real_ false alarms is very low.) and even the auto-spacey-quote corrections are _verified_, by colorizing the quotes so you can assess their correctness. for the double-quotes, you will step through each page, but -- as a demonstration of what i mean -- here's a _list_ of the colorized passages that were inside _single-quotes_: > http://z-m-l.com/go/mount/mount-singlequotes.png (single-quotes are actually much more difficult to check, because you need to control for apostrophe contractions.) this colorized verification ensures auto-changes are right... *** essentially, what makes this clean-up so bloody efficient is that the _computer_ is _finding_ the errors for you, and then making it _as_simple_as_possible_ for you to fix 'em. i can do this by locating the error-locating routines inside the tool that juxtaposes a text-editor with a scan-viewer, such that all three of these elements are working together. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/ed2632db/attachment.htm From hart at pglaf.org Tue Jul 22 08:45:13 2008 From: hart at pglaf.org (Michael Hart) Date: Tue, 22 Jul 2008 08:45:13 -0700 (PDT) Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 020 In-Reply-To: References: Message-ID: On Mon, 21 Jul 2008, Bowerbird at aol.com wrote: > michael said: >> I'm still hoping you will send me a list of all 20. > > for the founder, i can do that, yes. we will end up with more > than 20. Any idea what the expected total might be? > for everyone else, collecting them is a matter of having the > dedication. i did the work to define the set for this particular > "mountain blood" test; that's the hard part; the mere act of > collecting my posts is the easy part. Does that mean you would mind if I passed them on? > but there's little reason for anyone to collect these, unless they > plan on programming their own tool... _I_ collect all possible error hunting tools. . .period. I can't speak for others, but I'm willing to share. > i'll be releasing a version of my tool, which incorporates these > routines (and more), which anyone can use to clean up an o.c.r. > text, so use that. And it will run on what OS's? Thanks!!! Michael > > -bowerbird > > p.s. if anyone else does want to capture all of the posts in this > series, i'd suggest the july digest in the archives... > > p.p.s. for more routines, you can check out gutcheck and guiguts: >> http://gutcheck.sourceforge.net/index.html >> http://home.comcast.net/~thundergnat/guiguts.html > > > ************** > Get > fantasy football with free live scoring. Sign up for FanHouse Fantasy Football > today. > (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) > From Bowerbird at aol.com Tue Jul 22 09:05:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Jul 2008 12:05:42 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 021 Message-ID: 21. search for all lines with 11, or small-l-capital-i, or capital-i-small-l... > -----File: 011.png > [11] > -----File: 110.png > [110] > -----File: 111.png > [111] > -----File: 112.png > [112] > -----File: 113.png > [113] > -----File: 114.png > [114] > -----File: 115.png > [115] > -----File: 116.png > [116] > -----File: 117.png > [117] > -----File: 118.png > [118] > -----File: 119.png > [119] > -----File: 211.png > [211] > [3011 > -----File: 311.png > [311] > South Fork; Nickles'11 do it and glad. It will wipe > Ill (17) > Ill (164) > Ill (289) well, first, the "[3011" pagenumber was corrected to read "[301]". the 3 versions of "chapter iii" were corrected, misrecognized as "ill". and the "'11" which should have been an "'ll" was also corrected... 5 more lines corrected, for a grand total of 192, on 21 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/356d955f/attachment.htm From Bowerbird at aol.com Tue Jul 22 09:15:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Jul 2008 12:15:03 EDT Subject: [gutvol-d] one tenth of one percent Message-ID: good news! planet strappers has finished with its _10th_iteration_ through the "perpetual proofing" experiment. but, um... no, sorry, this iteration did not catch the error on page 33. maybe the _11th_iteration_ will catch it... more data from this iteration later... *** also, "mountain blood" -- a test which had parallel p1 proofings -- has now finished with p2. this is the book which i've been treating with my "how to do preprocessing clean-up" series, so it will be fun to see how my clean-up compares with _three_ rounds of proofing... stay tuned for that... *** and yes, i _do_ know that you're probably sick of the data from these d.p. "experiments", especially since they all show the same old thing, which is that the human proofers are doing an outstanding job, while the d.p. bureaucracy and workflow are immensely stupid and wasteful. believe me, i'm as tired of the minutiae of mistakes as you. likely more. but let's put this into perspective, ok? i've probably analyzed the data from a _dozen_ of these experiments... distributed proofreaders claims it has now digitized over 13,000 books. so i've analyzed less than _one_tenth_of_one_percent_ of their books. if we extrapolate, then where i have pointed to _thousands_ of changes in each book that were needlessly imposed on the volunteer proofers, what we realize is that -- over the course of their entire output so far -- d.p. has forced its proofers to make _millions_ of unnecessary changes... millions of unnecessary changes... chew on that fat... any volunteer with half a brain can easily and clearly see the inefficiencies. changing the same scanno on page after page after page, when they know it would be so much faster and easier to make the change _once_, globally. what a waste. how stupid. the inefficiencies, vast and deep, are staggering. how many bright people have walked away, refusing to be abused like that? i don't know. but it makes me very sad to think about it... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/96064af0/attachment-0001.htm From hart at pglaf.org Tue Jul 22 09:33:45 2008 From: hart at pglaf.org (Michael Hart) Date: Tue, 22 Jul 2008 09:33:45 -0700 (PDT) Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 021 In-Reply-To: References: Message-ID: how about searching for all lines with l1 ??? damn! those look so much alike I'm wondering if that IS what you wrote below, small-l-numeral-1??? the font i am using makes them look nearly exactly the same, but I think one is slightly dimmer than the other. lllllllllll vs 11111111111 Hmmm, even THEY don't all look the same. Oh, well. . .I guess it would be worth spelling them out, eh? mh On Tue, 22 Jul 2008, Bowerbird at aol.com wrote: > 21. search for all lines with 11, or small-l-capital-i, or > capital-i-small-l... > >> -----File: 011.png >> [11] >> -----File: 110.png >> [110] >> -----File: 111.png >> [111] >> -----File: 112.png >> [112] >> -----File: 113.png >> [113] >> -----File: 114.png >> [114] >> -----File: 115.png >> [115] >> -----File: 116.png >> [116] >> -----File: 117.png >> [117] >> -----File: 118.png >> [118] >> -----File: 119.png >> [119] >> -----File: 211.png >> [211] >> [3011 >> -----File: 311.png >> [311] >> South Fork; Nickles'11 do it and glad. It will wipe > >> Ill (17) >> Ill (164) >> Ill (289) > > well, first, the "[3011" pagenumber was corrected to read "[301]". > the 3 versions of "chapter iii" were corrected, misrecognized as "ill". > and the "'11" which should have been an "'ll" was also corrected... > > 5 more lines corrected, for a grand total of 192, on 21 routines... > > i'll be back tomorrow with the next suggestion in this series... > > -bowerbird > > > > ************** > Get fantasy football with free live scoring. Sign up for > FanHouse Fantasy Football today. > > (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) > From Bowerbird at aol.com Tue Jul 22 10:24:50 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Jul 2008 13:24:50 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 021 Message-ID: michael said: > how about searching for all lines with l1 ??? we already did a search where we looked for a letter-number combo, or a number-letter combo. many of these searches are redundant... one of the next searches i'll recommend is a search for any line which has a number in it (which will disregard the pagenumbers, of course). in a sense, that one general search could have been done instead of all of these more-specific searches. however, i'm listing the hits of all of these routines when run against the _original_o.c.r._, as if each was the _first_ such routine to be run. that's not how it actually happens when you do this in the real world. when you run the first routine, you _fix_ the errors the routine flags; and that means they won't come up when the later routines are run... so by running the routine that finds the "11" for "ll" misrecognitions, and fixing those first, you eliminate them from being hits for the later "any number" routine. this gives you a good focus when you're handling the specific routines -- you're looking at the same type of error, so the fixes are the same -- and means that the general routines (where you have to work harder to figure out "what is the nature of the error here?") return fewer hits. *** and, in a larger sense, that's even why we do such search routines first. many -- if not most -- of these glitches would be flagged with generic _spellcheck_, so we could just spellcheck and not bother with routines. but these routines give us a _focus_ that a general spellcheck does not, and that focus makes us more efficient. when we're done with all these, we will run a regular spellcheck, but by that time the text will be refined. *** > Oh, well. . .I guess it would be worth spelling them out, eh? this routine searched for "11" -- the number after 10, and it searched for small-l-capital-i (small ell and capital eye), and it searched for capital-i-small-l (capital eye and small ell). three errors came up on capital-i-small-l. this is a common glitch, where the "iii" (in all-capitals) that is the roman-numeral for "three" -- as in "chapter 3" -- is (mis)recognized as the word "ill" (capitalized). it's because the o.c.r. is trying to make it _into_ a real word -- ill -- which we can tell because the same type of glitch does _not_ occur on other instances with three capital-i in a row, such as with "xviii". this is why earlier i had a search for a capital-i followed by a bunch of letters, including l (i.e., ell). it's not that such words are _impossible_ -- or even _unknown_ or even _rare_ -- just consider "illustration" -- but it's worth those false-alarms to catch these subtle misrecognitions. when you've immersed yourself in this data, you can make judgments about the cost-benefit ratio of those false-alarms to those subtle hits... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/6ab271a3/attachment.htm From Bowerbird at aol.com Tue Jul 22 10:34:30 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 22 Jul 2008 13:34:30 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 020 Message-ID: michael said: > Any idea what the expected total might be? 25-30 routines; then we resort to the generic spellcheck. some of the routines need to control for _names_ as well, so there is one routine that will _gather_up_ those names. > Does that mean you would mind if I passed them on? do whatever you like with them. > And it will run on what OS's? mac, p.c., and various flavors of linux. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/f62f8e74/attachment.htm From Bowerbird at aol.com Wed Jul 23 00:55:17 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 03:55:17 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 022 Message-ID: 22. search for lines with numbers, excluding well-formatted pagenumbers. 7 lines presented, only 1 of which was correct. 3 of the incorrect lines involved pagenumbers. > COPYRIGHT, 1915, 1919, BY > PHWTED IN THE T7NITED STATES OlfAMEEIOA > 12Q4J > "You forget, unfortunately, that.1 am forced to > P25J > *?330} > South Fork; Nickles'11 do it and glad. It will wipe 6 more lines corrected, for a grand total of 198, on 22 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/6dffff85/attachment.htm From rfrank at pobox.com Wed Jul 23 01:01:04 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 02:01:04 -0600 Subject: [gutvol-d] algorithm description correction Message-ID: <4886E540.9050103@pobox.com> I've had some time to look back over recent posts here on gutvol-d. I read that: the preprocessing done on "mountain blood" followed the rule that any page that started with a capital letter or a double-quotemark got a line inserted above it, the assumption being it was a new paragraph. This is not correct. The preprocessing code looks at the line ending characteristics and line length on the preceding page to decide if a blank line should be inserted. The readily available cpprep source code shows this, so I'm not sure what the misstatement above is based upon. --Roger Frank From rfrank at pobox.com Wed Jul 23 01:04:13 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 02:04:13 -0600 Subject: [gutvol-d] preprocessing definition Message-ID: <4886E5FD.20506@pobox.com> Here's another interesting statement I see in a recent post: neither of these books got good preprocessing on them -- which is why i re-did the o.c.r. -- There are different definitions of "preprocessing" here. My preprocessing code is only code that analyzes the text and makes corrections it is confident are warranted. If it's not sure, it either flags it or leaves it for the proofers. If preprocessing is defined to include a person making a decision on if a correction is needed, then to me that's proofing, not preprocessing. Doing proofing at the start of a project doesn't make it preprocessing. I have scanned and content-provided well over 300 books over at Distributed Proofreaders. With the help of a lot of good people there , those books have been proofed and formatted effectively. If I did the first round of proofing myself (what some here call preprocessing), I would not have had the time to post-process the 240 books that I've uploaded to Project Gutenberg. It's that experience and the feedback and advice of active proofers that helps me to continue to develop what I hope is useful software. --Roger Frank From rfrank at pobox.com Wed Jul 23 01:07:57 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 02:07:57 -0600 Subject: [gutvol-d] rejoining hyphens Message-ID: <4886E6DD.7050004@pobox.com> This was posted recently about preprocessing: so i won't examine the (hundreds and hundreds) of _unnecessary_changes_ that the proofers had to do (e.g., rejoining hyphenates), because if that b.s. doesn't already _stink_badly_ to you, your nose isn't working correctly. I'm not sure what that means, but it sounds very negative to me. I think it implies that the preprocessing code doesn't attempt to rejoin hyphenated words. That is not how it works at all. The code for resolving hyphenation is fairly involved. What really happens in when a hyphen appears at the end of a line separating a word pair, the software looks for the hyphenated form including the hyphen throughout the entire text completely within a line. If it's convinced the author meant it to be hyphenated, the word is brought up an the hyphen is retained. If that wasn't conclusive, it looks for the concatenated word pair without the hyphen throughout the text to try to resolve it as non-hyphenated. If that isn't conclusive, a check is made of a contemporary word list of hyphenated words. And if that isn't conclusive, the word is left hyphenated and separated. What will usually happen in that case is the proofer will mark it with -* and I'll make a final decision in post-processing. Every attempt is made to match the then-current conventions of the particular book. As the developer of cpprep, I just wanted to speak authoritatively on how the preprocessing code works since it is being repeatedly mischaracterized on gutvol-d. --Roger Frank From rfrank at pobox.com Wed Jul 23 01:40:59 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 02:40:59 -0600 Subject: [gutvol-d] banana cream program Message-ID: <4886EE9B.4050706@pobox.com> This sounds very promising: i programmed such a tool years ago -- called "banana cream" -- and i've decided that in the light of recent improvements, i will be releasing a stripped-down version of it to the public very soon... I would certainly like to see the code for the "banana cream" program, with or without the newest changes. Except for ppvtxt and ppvhtml (programs that are used by post-processing verifiers to do final checks on submitted text and HTML files before submitting them to the whitewashers), none of my programs use a UI. i could've released this (banana cream) program years ago -- and intended to -- but since there were a several d.p. people among my antagonists here on this listserve, i decided to hold it back instead. in view of their silence recently, there's no need for continued punishment... My position is that the people that read this list are serious, dedicated people who genuinely want to learn and contribute to making the process better. Holding something back for personal reasons--as if to punish misbehaving children--isn't something I would do, but that's just me. All of my source codes are readily available to study, improve, constructively criticize and I hope, for many, to use. I'm looking forward to when we see an announcement "very soon" that the "banana cream" code is available. --Roger Frank From schultzk at uni-trier.de Wed Jul 23 02:10:24 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Wed, 23 Jul 2008 11:10:24 +0200 Subject: [gutvol-d] banana cream program In-Reply-To: <4886EE9B.4050706@pobox.com> References: <4886EE9B.4050706@pobox.com> Message-ID: <03D43930-4CBF-4A3C-B8C6-1B221111F0E8@uni-trier.de> Hi All, The person invovled I hope will reconsider offering the stripped down version. A stripped down version most always has its draw backs which arises in unneeded critic. Maybe a modularized version with nice interfaces (no not GUI -- function) would be nice. helps in the integration process. regards Keith. Am 23.07.2008 um 10:40 schrieb Roger Frank: > This sounds very promising: > > i programmed such a tool years ago -- called "banana cream" -- > and i've decided that in the light of recent improvements, > i will be releasing a stripped-down version of it to the > public very soon... > > I would certainly like to see the code for the "banana cream" > program, with or without the newest changes. Except for ppvtxt > and ppvhtml (programs that are used by post-processing verifiers > to do final checks on submitted text and HTML files before submitting > them to the whitewashers), none of my programs use a UI. > > i could've released this (banana cream) program years ago -- and > intended to -- but since there were a several d.p. people among > my antagonists here on this listserve, i decided to hold it back > instead. in view of their silence recently, there's no need for > continued punishment... > > My position is that the people that read this list are serious, > dedicated people who genuinely want to learn and contribute to > making the process better. Holding something back for personal > reasons--as if to punish misbehaving children--isn't something I > would do, but that's just me. All of my source codes are readily > available to study, improve, constructively criticize and I hope, > for many, to use. > > I'm looking forward to when we see an announcement "very soon" > that the "banana cream" code is available. > > --Roger Frank > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From schultzk at uni-trier.de Wed Jul 23 02:04:39 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Wed, 23 Jul 2008 11:04:39 +0200 Subject: [gutvol-d] preprocessing definition In-Reply-To: <4886E5FD.20506@pobox.com> References: <4886E5FD.20506@pobox.com> Message-ID: <71E8D8DC-E426-4D36-A5EA-D7D95C265DAB@uni-trier.de> Hi Roger, I understand your point, but your preprocessing should be post processing of the o.c.r. The way you have describe it here I would say have the process run automatically. You won't need to interevene yourself. That is what computers are there for in the first place. regards Keith. Am 23.07.2008 um 10:04 schrieb Roger Frank: > Here's another interesting statement I see in a recent post: > > neither of these books got good preprocessing on them > -- which is why i re-did the o.c.r. -- > > There are different definitions of "preprocessing" here. My > preprocessing > code is only code that analyzes the text and makes corrections it is > confident are warranted. If it's not sure, it either flags it or > leaves > it for the proofers. If preprocessing is defined to include a person > making a decision on if a correction is needed, then to me that's > proofing, not preprocessing. Doing proofing at the start of a project > doesn't make it preprocessing. > > I have scanned and content-provided well over 300 books over at > Distributed Proofreaders. With the help of a lot of good people > there , > those books have been proofed and formatted effectively. If I did > the first round of proofing myself (what some here call > preprocessing), > I would not have had the time to post-process the 240 books that I've > uploaded to Project Gutenberg. It's that experience and the feedback > and advice of active proofers that helps me to continue to develop > what > I hope is useful software. > > --Roger Frank > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Wed Jul 23 02:37:15 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 05:37:15 EDT Subject: [gutvol-d] is this a dialog? Message-ID: ok, i guess rfrank wants to have a dialog? or maybe not, i dunno. anyway, fine, roger, if you do, and fine if you don't. if you do, then you've got lots and lots of posts to read before you get up to speed on where i'm coming from, so i'll give you time to do all that reading if you want to. and if you just want to pop up here and accuse me of "mischaracterizing" what you're doing, and then burrow back into noncommunicative mode, that's all right too... really, roger, whatever you want to do. if you want to ignore everything i've written, and just engage me in general friendly conversation, that's fine, i'll be happy to share with you exactly what's on my mind. *** i'll just sit back and wait until the dust settles and make my replies to whatever you've said. *** but let's look at your 4 posts from tonight: 1. "algorithm description correction" you say that i've mischaracterized your algorithm, and say we can look at the code to see how it works. my reply is that my observation was based on _direct_ well... observation... of the actual o.c.r. text itself, and i stand by my observation. i could post the text itself, so that everyone can see that what i said is correct... or you can post the text, roger. i checked. i'm right. *** 2. "preprocessing definition" you seem to assume i've said that _you_ should have done the "preprocessing". i've never maintained that. i've said that _someone_ should. not necessarily you. in fact, i've explicitly said that it does _not_ have to be done by the content provider, that it could be done by a regular proofer as a book-wide process that occurs _after_ "content providing" and _before_ p1 proofing... just carve out a special round -- call it "p-zero" -- and have a designated "preprocessor" do the job... it might as well be the person who will post-process, since the jobs overlap so much. indeed, what i suggest is you do many of the tasks one does in post-processing _before_ the text goes to the p1 proofers. better that way. this concept is not strange. dkretz is already doing it, and he'll tell you that it's working out very well for him. and it's not all that hard to carve out a special round. d.p. did it big time when it went from 2 to 4 rounds, and again switching from 4 to 5 when they added p3. plus there's been a shift toward making smoothreading a quasi-official round too, this one gradual and informal, but still a demonstration that rounds can be carved out. > There are different definitions of "preprocessing" here. i use my definition. consistently. and i always have. > My preprocessing code is only code that analyzes the text > and makes corrections it is confident are warranted. i've found there are _very_ few of those type of "corrections". almost every darn routine will hit on a false-alarm _sometime_. so where your "definition" ends up is to do no preprocessing... whereas _my_ definition says that preprocessing is _efficient_. sure you have to _look_at_ the changes you make, but so what? while _you_ might define such "examination" as _proofing_ rather than _preprocessing_, i define "proofing" instead as the word-by-word examination of the text compared to the scan. my _preprocessing_ doesn't involve that word-by-word mode. you only fix the glitches that the computer can find _instantly_. but if a tool finds it instantly, why make a human search for it? that's counterproductive. i mean, after your proofers have spent, oh, maybe 12 hours proofing all the pages of a book, and then i drop it in a tool and the errors pop out _instantly_, well, it just makes me kind of shake my head and laugh a bit. > I have scanned and content-provided well over 300 books > over at Distributed Proofreaders. yeah, i know. you're doing _great_ work as a content provider... and you're doing _fantastic_ work as a post-processor as well... where you -- and d.p. in general, in a long-lived shortcoming -- are coming up short is in your _preprocessing_, which sucks... which means you're wasting a _lot_ of the time of your proofers, because they must make unnecessary changes. i've documented these unnecessary changes over _many_ of your "experiments"... if you've analyzed your data anywhere near as closely as i have, please share your results and conclusions here. because i have. > I have scanned and content-provided well over 300 books > over at Distributed Proofreaders. > With the help of a lot of good people there > those books have been proofed and formatted effectively effectively? yes. the proofers rock. efficiently? no. not by a long shot. the absence of good preprocessing has wasted many resources. if you had done that, you could have finished over 600 by now... *** 3. "rejoining hyphens" it's getting late. let's talk about rejoining hyphenates later, ok? but i can assure you that i know _all_ about how it's done... :+) *** 4. "banana cream" > I would certainly like to see the code for the "banana cream" the code isn't available. only the compiled app. the code wouldn't help you much anyway, since it's basic (as in "beginners all-purpose symbolic etc.), and besides, it's the design and operation of the tool which is what you _really_ need to be interested in... > Except for ppvtxt and ppvhtml > (programs that are used by post-processing verifiers > to do final checks on submitted text and HTML files > before submitting them to the whitewashers), > none of my programs use a UI. i don't believe i've seen any programs from you that _do_ have a g.u.i., so i didn't know you even had any. i'm glad. because yeah, you really must have a g.u.i. to make it work. if you've read one of the latest posts here from me, i've said that you need to have _three_ elements all working together, the first one being the text-editor, the second a scan-viewer, and the third being the routines that find the lines with errors. of course, you also need good old "find" functionality, and auto-generated tables of contents, and lots of other stuff, but those big three are the basic heart-and-soul of the tool. until you have that, you don't really have much at all... > My position is that the people that read this list are serious, > dedicated people who genuinely want to learn and contribute > to making the process better. yeah, well, i guess you haven't been reading along for 5 years. :+) if you want to know the truth, you can always read the archives. > Holding something back for personal reasons-- > as if to punish misbehaving children-- that's exactly right, i was punishing the misbehaving children. i haven't ever said it that way, but that's a very apt description. and moreover, i was punishing the _friends_ of those children by denying them a tool that could've been very useful to them, simply because those children were misbehaving so very badly. so yes, a few bad children here meant everyone at d.p. had to go without a useful tool, for years. hey, no skin off my nose... (of course, these friends also had some culpability, because they _could_ have reined in the ones who were misbehaving so badly.) > isn't something I would do, but that's just me. it is something i would do. something i've done. and would do again. and again and again. i don't let bullies pick on me or treat me badly... i give 'em their own medicine, and i make sure my rocks hit the target. however, if you wanna be nice and friendly, i'm nice and friendly back. > All of my source codes are readily available to study, improve, > constructively criticize and I hope, for many, to use. that's nice. but usually i'm more interested in the tool than its source. i once offered to lead an effort here to create some open-source stuff, including full-on open-source replacements for my close-source tools. but i was the little red hen who was doing all the work, so i ended that. but, you know, if you want me to re-start that effort and you'll do work, i will be happy to guide you in an effort to create tools similar to mine... feel free to pick my brain. i'll tell you anything you want to know. but no, i won't hand over my code. you can buy it. but it's not free. you can even buy it and (because you'll own it) turn it open source. but i'm not gonna give it to you. besides, wouldn't help you anyway. > I'm looking forward to when we see an announcement "very soon" > that the "banana cream" code is available. the code won't be available. but the tool will be. could be right now, if i wanted it to be. if you'd like to see a copy right now, and you agree to write some criticism of it (constructive or not, doesn't matter to me) and send it as a post to this listserve, i can send you a copy right now... i got other stuff available too, some of it web-based, if you're curious... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/0a80742c/attachment-0001.htm From marcello at perathoner.de Wed Jul 23 04:54:49 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 23 Jul 2008 13:54:49 +0200 Subject: [gutvol-d] is this a dialog? In-Reply-To: References: Message-ID: <48871C09.4030501@perathoner.de> Bowerbird at aol.com wrote: > ok, i guess rfrank wants to have a dialog? > > if you do, then you've got lots and lots of posts to read > before you get up to speed on where i'm coming from, Or he could go here http://www.gnutenberg.de/bowerbird/ and get the gist in a few minutes. -- Marcello Perathoner webmaster at gutenberg.org From jayvdb at gmail.com Wed Jul 23 04:55:27 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Wed, 23 Jul 2008 21:55:27 +1000 Subject: [gutvol-d] is this a dialog? In-Reply-To: References: Message-ID: Hi, On Wed, Jul 23, 2008 at 7:37 PM, wrote: > ok, i guess rfrank wants to have a dialog? > > or maybe not, i dunno. > > anyway, fine, roger, if you do, and fine if you don't. I've have only recently started paying attention to the inner workings of Project Gutenberg, for reasons explained below, so I have very little understanding of the backstory. This level of confrontation seems very unhealthy, for all involved and the project as a whole, because my first impression on joining this list was an atmosphere that I was not expecting. It wouldnt be such a problem if the majority of discussion was more collegial and it was only the occasional spat. I've started looking at PG as a project I am heavily involved in, Wikisource, is coming to the stage where it is completing proofreading projects on a regular basis, and these texts are suitable to be pushed into PG, as Distributed Proofreaders currently does. The German Wikisource has long had a policy of rejecting contributions that not accompanied by pagescans, so they have many texts which are suitable and verified. The English and French projects are not so stringent, but are increasingly focusing on proofing based on pagescans. Here are our stats: http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics The works are available here: http://de.wikisource.org/wiki/Kategorie:Index http://fr.wikisource.org/wiki/Cat%C3%A9gorie:Index http://en.wikisource.org/wiki/Category:Index We may never be as big a contributor as the fine people at DP, but we can help. More importantly, we can compete. Even if we are not in the same league as DP, Wikisource can compete in processes, methods, etc. Efficiency, which bowerbird talks about, is orthogonal to output volume. If DP is inefficient, the most effective way of demonstrating this is build a better mouse trap. Personally, I think Wikisource has a decent mousetrap, we have developers regularly improving the software, and because it is a wiki, we have a userbase that is constantly improving the interface and social fabric. Also because it is a wiki, _anyone_ can run a bot to automate parts of the process. I use the "pywikipediabot" codebase, but there are many other frameworks that connect the coder to the wiki interface. I would like to publicly encourage bowerbird to come to Wikisource, evaluate it, and either tell us what is wrong with it, or better yet, demonstrate your code in action doing the pre-processing. Obviously your time is precious, and your input into our growth will be valued, so I will make myself available to help you ramp up quickly. If your tool cant be bolted onto the wiki framework with minimal enhancement to your tool, we can work out an interface that will suit, or I will sign an NDA and help you code it releasing all copyright to you. > if you've read one of the latest posts here from me, i've said > that you need to have _three_ elements all working together, > the first one being the text-editor, the second a scan-viewer, > and the third being the routines that find the lines with errors. and if I understand your prior posts correctly, you want a third element, the user. The user approves or rejects the suggested change? > but, you know, if you want me to re-start that effort and you'll do work, > i will be happy to guide you in an effort to create tools similar to mine... I am keen. Pick me. > you can even buy it and (because you'll own it) turn it open source. > but i'm not gonna give it to you. besides, wouldn't help you anyway. If your GUI tool can bolt onto Wikisource as the backend, and you price licenses sensibly, I know a few people who will buy it. And, I'd like to hear what sort of figure you have in mind, as a "open source bounty" might be a way to make everyone happy. -- John From hyphen at hyphenologist.co.uk Wed Jul 23 06:28:03 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Wed, 23 Jul 2008 14:28:03 +0100 Subject: [gutvol-d] banana cream program In-Reply-To: <4886EE9B.4050706@pobox.com> References: <4886EE9B.4050706@pobox.com> Message-ID: <001701c8ecc7$f47c1a00$dd744e00$@co.uk> Roger Frank wrote: > I'm looking forward to when we see an announcement "very soon" > that the "banana cream" code is available. Hope you rename it as something more descriptive of its function. I would not expect a program with such a name to help me in any way. Dave Fawthrop From marcello at perathoner.de Wed Jul 23 07:50:39 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 23 Jul 2008 16:50:39 +0200 Subject: [gutvol-d] banana cream program In-Reply-To: <4886EE9B.4050706@pobox.com> References: <4886EE9B.4050706@pobox.com> Message-ID: <4887453F.4020702@perathoner.de> Roger Frank wrote: > I'm looking forward to when we see an announcement "very soon" > that the "banana cream" code is available. We already saw that announcement 3 years ago on 27 Aug 2005: > i'll be uploading banana-cream to the web next week; > but anyone who would like to use it before then can > backchannel me for a preview copy... What we will never see is the working tool, because writing code is harder than writing announcements. BB said more than once that he will not release his code, and, believe me, from the code snippets he *did* publish, you wouldn't want to see it anyway. This is how he writes regular expressions: BB wrote on 8 Jul 2008: > it's very good to do an early search for garbage characters... > the reg-ex is something along these lines: [\&\*\<\>\\\/\|\*\{\}\_] More BB code at: http://www.gnutenberg.de/bowerbird/#reality BB wrote on 11 Jul 2008: > i could've released this program years ago -- and intended to -- > but since there were a several d.p. people among my antagonists > here on this listserve, i decided to hold it back instead. in view of > their silence recently, there's no need for continued punishment... History repeats itself. And some children never grow. Film at: http://www.gnutenberg.de/bowerbird/#toc31 -- Marcello Perathoner webmaster at gutenberg.org From rfrank at pobox.com Wed Jul 23 09:30:14 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 10:30:14 -0600 Subject: [gutvol-d] preprocessing definition In-Reply-To: <71E8D8DC-E426-4D36-A5EA-D7D95C265DAB@uni-trier.de> References: <4886E5FD.20506@pobox.com> <71E8D8DC-E426-4D36-A5EA-D7D95C265DAB@uni-trier.de> Message-ID: <48875C96.9050300@pobox.com> Schultz Keith J. wrote: > I understand your point, but your preprocessing > should be post processing of the o.c.r. > > The way you have describe it here I would say > have the process run automatically. You won't > need to interevene yourself. That is what computers > are there for in the first place. > > regards > Keith. Somehow I mis-explained it, Keith. My cpprep.rb code runs on the output of Abbyy, which I have save each page as UTF. I don't intervene or look at the pages myself. I do look at the log because of a few really rare cases that happen only once in many books. For example, it is possible that a sincle word both starts and ends with an apostrophe, and the smart quote routines will get that wrong every time if it isn't in the exceptions list. So it does run against the text from the OCR. All I get is a summary, like this excerpt from a recent book just uploaded to the proofers (In Her Own Right): 2253 start of line spaced double quote 835 double quote spacing, type 1 403 double spaces 397 end of line spaced double quote 261 double quote spacing, type 2 236 spaced punctuation 80 too many dashes 32 false paragraph break suspect 25 spaced exclamation 20 single quote spacing, type 2 15 end page asterisk added 15 start page asterisk added 12 spaced double-punctation 8 suspect start of line 5 single quote spacing, type 1 4 three dashes 3 end of line spaced punctation 2 spaced out 'll 2 numeric 11 for letters ll 2 missing space? 1 suspect l1 1 spaced out 's I hope what cpprep does is more clear from this. --Roger Frank From prosfilaes at gmail.com Wed Jul 23 09:39:34 2008 From: prosfilaes at gmail.com (David Starner) Date: Wed, 23 Jul 2008 12:39:34 -0400 Subject: [gutvol-d] is this a dialog? In-Reply-To: References: Message-ID: <6d99d1fd0807230939o1d765cd1g9ae51783488c52c5@mail.gmail.com> On Wed, Jul 23, 2008 at 7:55 AM, John Vandenberg wrote: > I've started looking at PG as a project I am heavily involved in, > Wikisource, is coming to the stage where it is completing proofreading > projects on a regular basis, and these texts are suitable to be pushed > into PG, as Distributed Proofreaders currently does. That's interesting; I was thinking that we could push many of the texts that DP does directly from DP to Wikisource from the pre-PPer stage, which would provide a public archive of the scans with a page-by-page proofed version. A filter should be able to translate DP's formatting to Wikisource's pretty easily. The two ideas aren't mutually exclusive, of course. From Bowerbird at aol.com Wed Jul 23 10:49:53 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 13:49:53 EDT Subject: [gutvol-d] is this a dialog? Message-ID: john- wow. i certainly wasn't expecting anything like _that_. what a nice surprise. i'm bowled over. i will be happy to go over and take a look at wikisource. it would be a pleasure to offer my constructive criticism to an entity smart enough to actually treasure such input. and i'd be honored to help improve your infrastructure... right from the get-go -- with the wiki structure and your ability to run bot-based error-finding routines -- i'd say you have some fantastic potential there. really fantastic. my apps are written in basic, so my code won't help you, but i'm skilled at expressing them in pseudo-code, so if you've got web-programmers to implement my routines, we'll be able to work together. and if it wasn't clear, my offline tools are cross-plat apps that are available at zero cost. (i'd guess that people still are more efficient doing this work offline, but i'm willing to let some crafty web-programmers prove me wrong...) i will respond here when i've taken a look at wikisource, just to show the kind of interaction p.g. could have had, but if we continue on for long, we can take it elsewhere... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/3553319a/attachment.htm From Bowerbird at aol.com Wed Jul 23 10:51:18 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 13:51:18 EDT Subject: [gutvol-d] banana cream program Message-ID: dave said: > Hope you rename it as something more descriptive of its function. it's a one-file application, dave. name it whatever you want. :+) -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/2aee420c/attachment.htm From Bowerbird at aol.com Wed Jul 23 11:09:39 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 14:09:39 EDT Subject: [gutvol-d] preprocessing definition Message-ID: roger said: > I don't intervene or look at the pages myself. my experience is that preprocessing can't do all of what needs to be done if the methodology does not involve a human who will look at the pages... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/3e5d3d93/attachment-0001.htm From rfrank at pobox.com Wed Jul 23 11:23:21 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 12:23:21 -0600 Subject: [gutvol-d] not a dialog Message-ID: <48877719.2060505@pobox.com> Bowerbird wrote ok, i guess rfrank wants to have a dialog? or maybe not, i dunno. Let me reply directly and say No, I am not interested in a dialog. In that same post, I read that I might "burrow back into noncommunicative mode" I am communicative to people whose opinions I value. The proofers and formatters and smoothies and yes, the management at Distributed Proofreaders are all on that list. That why I spend the time doing the documentation of my code, as best as I can given that I'd rather code than document code. Follow the "software" link at http://pgdp.rfrank.net or http://www.fadedpage.com/ppgen-main.html for the more popular programs or for the alpha versions of the documentation on the post-processing generator. I value Greg Newby's interest in preprocessing, too. That's why I posted clarification of the way the code actually works once it was mischaractized on the list. Before Greg's post, I didn't think anyone was paying that much attention, figuring that the list was mostly poisoned, that you were on the multiple kill lists, and that creative discussions were finding other more fertile places to grow. Nobody is on my kill list. I just look at everything and if it's a personal attack or a diatribe or some form of competition as to who is right, then I just click delete. There is some technically worthwhile material in your posts, though as of yet it's not news to me, so I look at it at least until the post gets personal, and it almost always does. Even in this same posting of yours I read: i'll just sit back and wait until the dust settles and make my replies to whatever you've said. *** but let's look at your 4 posts from tonight: You didn't sit back very long--perhaps just the time to type the three asterisks? Then you go on to say you observe an algorithm (that is not in the code). You say everyone can see that what i said is correct... or you can post the text, roger. i checked. i'm right. I don't need to post the text. If anyone is really interested in who is "right" here, they can look at the text as you did or look at the code or whatever they want. I simply don't care who is "right" here. To me, it's not a competition. I'll always try to do my best to make this process better, and no part of that is in trying to make anyone else look worse. People have already made their own judgement about that, it seems. Later it says > There are different definitions of "preprocessing" here. i use my definition. consistently. and i always have. That's fine, of course, as long as we understand that it's a manual process. That "P0" round is great if there are proofers available who would work in that round, and if they are sufficiently talented with regexs to find the trouble spots, or if a tool were available to take them to the trouble spots. From what I read, the tool you announced as imminent may not be coming soon--I hope I'm wrong about that. I hope you don't decide to "punish" us and not release it as you have apparently for the last two years since you last announced it. It's unfortunate that no part of your code is available in source. That keeps others from taking it and improving it. I would have liked to see whatever it is, in whatever shape it is in, on the chance that it is licensed appropriately and that I could use at least the display portion of the code. Someone posted that perhaps that premature availability would lead to unneeded criticism. I'd rather see if it's going to be useful now than wait an indefinite time for it to be polished. If it's worthwhile, I could wrap it in a MVC framework and we could all benefit from it. You've decided to withhold the source, which makes me wonder if it exists at all. Back to the main reason for this posting. You wondered if I wanted a dialog with you here. I do not. You wrote: where you ... are coming up short is in your _preprocessing_, which sucks... Actually, that's not particularly helpful or appropriate, to me. The sad part of it all is that when the list gets personal, good people who might otherwise post interesting or even exciting topics simply stay away. My observation is that since you were banned from DP that the forums there have been much more productive, that many more people are willing to share their ideas there, and that overall it's just a good place to work and contribute. It's sad that, given you feel you have good ideas, you can't find a way to present them without making it a competition and without making it personal. There is no "I" in Project Gutenberg. --Roger Frank From joshua at hutchinson.net Wed Jul 23 12:11:58 2008 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Wed, 23 Jul 2008 19:11:58 +0000 (GMT) Subject: [gutvol-d] not a dialog Message-ID: <238672185.93911216840318819.JavaMail.mail@webmail09> On Jul 23, 2008, rfrank at pobox.com wrote: I value Greg Newby's interest in preprocessing, too. That's why I posted clarification of the way the code actually works once it was mischaractized on the list. Before Greg's post, I didn't think anyone was paying that much attention, figuring that the list was mostly poisoned, that you were on the multiple kill lists, and that creative discussions were finding other more fertile places to grow. ******* Just wanted to pop in to ask if you (or anyone else) has looked into incorporating these checks into the proofing interface at DP? What I mean is, it would be useful to have a "spell-check" like utility within the DP proofing window that highlights "possible" problems and let the proofer check them and change as necessary. You could really dial up the automated check code since final say would be by the proofer (human) and you aren't as reliant on the system being 100% correct before making a change. Josh From dakretz at gmail.com Wed Jul 23 12:38:27 2008 From: dakretz at gmail.com (don kretz) Date: Wed, 23 Jul 2008 12:38:27 -0700 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 26 In-Reply-To: References: Message-ID: <627d59b80807231238o624ab523od907849b98b91d3d@mail.gmail.com> Someone suggested I should check this thread, and I'm glad I have. It looks a bit like spring-time in Prague. :) I have a couple things I can chip in, if anyone might find them useful. First, I have several dozen regular expressions I've honed over a couple years that I use to preprocess text for the ongoing Encyclopedia Britannica project. The two strongest areas it deals with are numbers (intentional or otherwise from the OCR), and double-quotes. I suspect either of the tools from rfrank or the Bird are stronger on the quotes; but my number handling goes further than what B has shown so far, anyway (based on a quick perusal of the thread archive.) In addition, Michael Lockey (vasa) and I have an alpha-test-level reimplementation of the DP process that eliminates the infamous Rounds, and supports independent tracking of text units (pages or otherwise). Since I'm already a known preprocessing fanatic on the dp site, it's intentionally friendly to that type of work. It's written (as is the pg site) in php and mysql. The main question mark I have is how to build the UI. I've been working with Adobe Flex quite a bit recently, and find it useful but not compelling. But somehow I think we need a highly portable UI that shows a. text analysis by location, b. quickly jumps through the locations of interest, c. synchronously pulls up the matching image automatically, d. provides a configurable workflow checklist with all the features of gutcheck plus new ones. I haven't seen anything yet that's not pretty clumsy and slow in one way or another. Bird, I think you've traditionally used BBasic or some such, right? I like your checklist so far; but what's your policy wrt text-interrupters like footnotes/sidenotes, tables, math expressions, etc.? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/5345da6f/attachment.htm From Bowerbird at aol.com Wed Jul 23 13:21:41 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 16:21:41 EDT Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 26 Message-ID: dakretz (a.k.a. dkretz) said: > It looks a bit like spring-time in Prague. :) smileys are good. i like smileys. :+) > First, I have several dozen regular expressions > I've honed over a couple years that I use to preprocess text > for the ongoing Encyclopedia Britannica project. i tried to locate the d.p. forum threads where you listed a ton of reg-ex. i found some of 'em, but i distinctly think that i missed some other ones. so if you could point to them, it would be good. > The two strongest areas it deals with are numbers > (intentional or otherwise from the OCR), and double-quotes. > I suspect either of the tools from rfrank or the Bird are stronger > on the quotes; but my number handling goes further than what B has > shown so far, anyway (based on a quick perusal of the thread archive.) i remember you describing those routines, yes. i haven't analyzed a lot of texts that have those types of problems, so i don't have many routines to check for them. this work is always tremendously iterative, in that one finds errors after-the-fact and try to come up with a way that one could have found them automatically. i'm sure that, besides numbers, there are many other idiosyncrasies of the encyclopedia britannica where your routines are equally unique... > Since I'm already a known preprocessing fanatic on the dp site, > it's intentionally friendly to that type of work. It's written > (as is the pg site) in php and mysql. you have been _fantastic_ as the "known preprocessing fanatic" at d.p., ever since i was banned from there. thanks for fighting the good fight! and your proofing interface is quite good. and the roundless methodology is what i have _always_ suggested, as you know. so i think you're the smartest guy over at d.p. :+) > The main question mark I have is how to build the UI. well, the crucial decision-point in that matter is "offline or online". preprocessing really is much better using a "whole-book" method, so i believe it lends itself to an _offline_ approach, not an online one. that is, it's best when it's done by _one_person_, who has the _text_ and the _scans_ on their hard-drive. a huge part of the methodology is that you're jumping all around in the book, based on the error-type that you are specifically seeking right now, so it doesn't make sense to work in the browser, not relative to the cost-benefit of working offline. and, for me, it's easier to program offline apps, because i know how. > I've been working with Adobe Flex quite a bit recently, yes, i know that. and i've been encouraged by that, precisely because it offers (promise of) a chance to bridge the offline/online distinction. > and find it useful but not compelling. really? i'd have thought your reaction would be a bit more positive. because you've started to come to the interface you'll need to build. your new interface for "twister", the one that displays the page-image automatically depending on the selection in the listbox, is the _key_. (now you just have to load the text into an editfield on the left side.) > But somehow I think we need a highly portable UI that shows > a. text analysis by location, > b. quickly jumps through the locations of interest, > c. synchronously pulls up the matching image automatically, that's the interface i've laid out before, the interface of banana cream. (but you need to add, for item (a), that the text is there, and editable.) > d. provides a configurable workflow checklist > with all the features of gutcheck plus new ones. you don't really need this. (it's largely subsumed in item (b).) > I haven't seen anything yet that's not pretty clumsy > and slow in one way or another. banana cream is neither clumsy nor slow, in any way... > Bird, I think you've traditionally used BBasic or some such, right? realbasic. http://www.realsoftware.com > I like your checklist so far; but what's your policy wrt text-interrupters > like footnotes/sidenotes, tables, math expressions, etc.? i'll explain those later. for now, just let them go in the flow... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/37400ce7/attachment.htm From Bowerbird at aol.com Wed Jul 23 14:00:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 17:00:28 EDT Subject: [gutvol-d] the dust Message-ID: well, roger has kicked up some dust, so i'll let that settle... in the meantime, since roger mentioned "a woman in her own right", i've modified that project -- i refuse to work with badly-named files -- and uploaded it to my site so that we can take a closer look at it: > http://z-m-l.com/go/wihorp001.html > http://z-m-l.com/go/wihor.zml i don't know if i've mentioned that i've also done "the crevice": > http://z-m-l.com/go/crvicp001.html > http://z-m-l.com/go/crvic.zml and "cabin on the prairie": > http://z-m-l.com/go/cabinp001.html > http://z-m-l.com/go/cabin.zml so we will go through the whole data-analysis exercise on these books. and, of course, finish up the books that we've already discussed so far... but i _swear_, i'm not taking on any more d.p. "experiments" after these. i have demonstrated -- clearly and unequivocally -- that i am correct when i praise the work of the proofers and condemn the d.p. workflow, and there's no sense in continuing to prove something so transparent... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/cb2efaa8/attachment.htm From walter.van.holst at xs4all.nl Wed Jul 23 14:13:28 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Wed, 23 Jul 2008 23:13:28 +0200 Subject: [gutvol-d] not a dialog In-Reply-To: <238672185.93911216840318819.JavaMail.mail@webmail09> References: <238672185.93911216840318819.JavaMail.mail@webmail09> Message-ID: <48879EF8.3010104@xs4all.nl> Joshua Hutchinson wrote: > Just wanted to pop in to ask if you (or anyone else) has looked into > incorporating these checks into the proofing interface at DP? > > What I mean is, it would be useful to have a "spell-check" like > utility within the DP proofing window that highlights "possible" > problems and let the proofer check them and change as necessary. You > could really dial up the automated check code since final say would > be by the proofer (human) and you aren't as reliant on the system > being 100% correct before making a change. I gathered from the DP fora that Jeroen Hellingman is working on a 'heat map' that colours such potential problems. But since he is on this list as well, he may be able to fill you in on the details. Regards, Walter From Bowerbird at aol.com Wed Jul 23 14:42:39 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 17:42:39 EDT Subject: [gutvol-d] not a dialog Message-ID: gee, interesting to see -- because walter quoted his post -- that josh has made a good suggestion. (josh is one of those "bad children" who has been relegated to my "spam" folder".) oh, wait, that's the exact same suggestion i made over at d.p., when they were developing wordcheck. before i got banned... these guys throw mud at me for saying stuff, and then turn around and say the same thing. it's humorous to me... but, no, walter, the "heat map" idea that jeroen is working on -- an idea that he lifted directly from my post on this list -- is not intended to be integrated in the proofing environment. and the problem with using _colors_ is that you can't use them in a web-based textfield -- at least not in the past, although new w.y.s.i.w.y.g. textfields might allow you to do it now, but d.p. hasn't seemed willing to do the work to incorporate those. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/2a97292c/attachment-0001.htm From Bowerbird at aol.com Wed Jul 23 14:47:18 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 17:47:18 EDT Subject: [gutvol-d] woman in her own right -- 001 Message-ID: ok, let's take a first look at "woman in her own right"... right off, a search for garbage characters reveals 55 lines, appended... just as a sign of _respect_ for the proofers, i would not send them lines that included "<" or "+" or "^" in them. just a sign of respect. i want them to spend their time finding stuff that's _hard_ to find... further, as with "mountain blood", the runheads and pagenumbers are in the text. what's up with that? d.p. policy is to remove them, and it's something the computer can do quite easily, so why have the _proofers_ spend their time and energy doing it instead? that's b.s. of course, since i don't think the runheads and pagenumbers should be removed, not at this time, i do it _automatically_ at a later stage, i'm _glad_ that i can get the a copy of the text that still includes them. but my heavens, why make the proofers do the deletion grunt-work? the response might be that "this is a newcomers-only project, so we leave the runheads in so newcomers learn to delete them." sorry, i don't buy that. if you have the machine delete them (and, of course, check that the operation was done correctly), there is no need for anyone to "learn" they need to be deleted. so this is just another waste of the proofers time and energy. moreover, it creates a _diff_ for each page, so -- when you look back -- you have no way of easily discerning how clean the project really was until you do a _second_ round of proofing, and _then_ find all no-diffs. a meaningless diff on every page can mask important information... -bowerbird > VIII.--STOLEN.....^................................ 120 > ~AND STEPPED TWO^pUNDHED AND FIFTY PACES....... 112 > "Royster & Axtell have been thrown into bankruptcy. > of it with Royster & Axtell, who knows?" > "Well, it's come," he remarked: ~" Royster & > ass----(U+00BB) > "Tell me of Royster & Axtell," he said. > sudden resolve only the failure of Royster & > "I'll speak to Fra^ois," said Macloud, arising. > "I see Royster & Axtell went up to-day. I > again,--an ROYSTER & AXTELL FAIL! > & Axtell failure," and, with that, he would pass > languid: ~" Been away, somewhere, haven't you ? > He took the night's express on the N. Y., P. & > "Colonel Duval is dead, however," she added^- > Tery satisfactory, indeed. And he was a competent (U+201E) > "Sut'n'y, seh," returned the dark}'. "Dat's > if Gaspard, his particular waiter, missed him ? > / see you not again, Farewell. I am, sir, with > "Y'r humb'l # obed't Serv'nt > Croyden nodded. Then proceeded^ Urith. much ap-* > CONFIDENCE AND SCRUTI^S > "Your recent experience with Royster & Axtell > the Duvals didn't keep an eye on Greenberry Point ? > persisted. "Has Royster & AxtelPs failure anything > "They're safe-- "I'm glad to make your acquaint---->" began > self----(U+00BB) > & Axtell's loan," she said. "Oh, don't be alarmed! > spent, by his own fireside, alone! Alone! Alone ] > "You are determined ?--Very well, then, come > Once upon a time--->--" and laughed, softly. > possession recently, you, with two companions^ > "But you're not quite sure ?--oh! modest man!" > "Nothing! "s^id Croyden. "You're a good > "Humph! Blaxham & Company! "he grunted. > the Bonds and the stock of Royster $ Axtell, > & Company bought them at the public sale." > "I could refuse to sell unless Blaxham & Company > Macl^ud observed; ~" though, it's a pity to tilt at > moment, will you ?--you're hipped on it!" > "Than your Southern ancestors ?--isn't that > "/ sent him! I don't know the man." > \ > else--won't you tell me where you are? (/ don't > will be: ~' Come over and see us, won't you ?'" > millionaire. We've got >our share of fools, but we > \ > 280 IN HER OWN RIGHT / > (U+00AB)No!~----" > THE LONE HOUSE BY THE BAY (U+00A3)87 > \permanent residence." > \ > the Treasure--/ have lifted the Iron box, from the ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/643af6dc/attachment.htm From rfrank at pobox.com Wed Jul 23 15:02:28 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 16:02:28 -0600 Subject: [gutvol-d] not a dialog In-Reply-To: <238672185.93911216840318819.JavaMail.mail@webmail09> References: <238672185.93911216840318819.JavaMail.mail@webmail09> Message-ID: <4887AA74.6000601@pobox.com> Joshua Hutchinson wrote: | Just wanted to pop in to ask if you (or anyone else) has | looked into incorporating these checks into the proofing | interface at DP? That would be a big boost to productivity. The difficulty for me is that I'm comfortable with Ruby and Perl but uncomfortable with PHP, and I think that's an important deficiency for anyone wanting to integrate it at DP. That's why for me it's a standalone utility, like guiprep, only written in Ruby--it's just my limitation in being able to put it inside a wrapper with something stronger than a textbox widget. If I could find the equivalent of guiguts' built in editor/presentation manager, only written in Ruby, I would certainly use it. That would at least make it interactive in a "proofing round 0" sense. So bottom line, for me the answer is that it's only a "I wish I was smart enough to do that" kind of thing. As a proofer myself at DP, I agree it would be a big win. --Roger Frank From rfrank at pobox.com Wed Jul 23 15:16:45 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 16:16:45 -0600 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 26 In-Reply-To: <627d59b80807231238o624ab523od907849b98b91d3d@mail.gmail.com> References: <627d59b80807231238o624ab523od907849b98b91d3d@mail.gmail.com> Message-ID: <4887ADCD.8080508@pobox.com> don kretz wrote: > First, I have several dozen regular expressions I've honed over a couple > years that I use to preprocess text for the ongoing Encyclopedia > Britannica project. That sounded familiar. I baked something specific you suggested into the post-processing verification code (ppvtxt.pl) back on 3/19/2007. The changelog shows I incorporated special consideration for &c "as in EB articles." That said, I'm pretty sure I haven't seen the complete list of several dozen and I sure would like to. How can that happen? --Roger Frank From dakretz at gmail.com Wed Jul 23 15:18:23 2008 From: dakretz at gmail.com (don kretz) Date: Wed, 23 Jul 2008 15:18:23 -0700 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 28 In-Reply-To: References: Message-ID: <627d59b80807231518u65474dbey521f4858e3fc19fd@mail.gmail.com> > > > From: Joshua Hutchinson > To: gutvol-d at lists.pglaf.org > Date: Wed, 23 Jul 2008 19:11:58 +0000 (GMT) > Subject: Re: [gutvol-d] not a dialog > > > Just wanted to pop in to ask if you (or anyone else) has looked into > incorporating these checks into the proofing interface at DP? > > What I mean is, it would be useful to have a "spell-check" like utility > within the DP proofing window that highlights "possible" problems and let > the proofer check them and change as necessary. You could really dial up > the automated check code since final say would be by the proofer (human) and > you aren't as reliant on the system being 100% correct before making a > change. > > Josh > > You may remember that I implemented a new proofing interface a year or two ago, which provided a "preview" mode showing real italics, etc. That has since added a quote-matching display, and a punctuation reasonability-checker. They may still be on the dev server - I haven't checked for a long time. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/718d13d8/attachment.htm From Bowerbird at aol.com Wed Jul 23 15:24:54 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 18:24:54 EDT Subject: [gutvol-d] not a dialog Message-ID: roger said: > Let me reply directly and say > No, I am not interested in a dialog. fine. no problem. > Before Greg's post, I didn't think anyone was > paying that much attention, figuring that > the list was mostly poisoned, that you were > on the multiple kill lists, and that creative discussions > were finding other more fertile places to grow. when i talked about "the dust settling", it was because i anticipated this reply from you, roger... this is the post where you kick up dust. and once that dust has settled, i'll still be here, examining things... in the meantime, for someone who doesn't want a dialog, you sure have brought up a lot of issues. so while i wait for the dust to settle, and resume with my monolog mode, let me tackle this newest crop of 'em... *** the first issue seems to be about the blank lines at the top of some pages -- and not other -- in the "mountain blood" book. i've posted the o.c.r. text. people can find it here: > http://z-m-l.com/go/mount/mount.ocr.txt as it shows, clearly, i was right when i described what pages _do_ and _do_not_ have a blank line at the top of them... i've also appended a list of the pages that _do_ have a blank line -- it's all the pages that start with a capital letter or a double-quote -- and a list of the pages that _do_not_ have a blank line at the top -- it's all the pages that start with a lower-case letter or a dash -- which is _exactly_ what i said. i have already listed the pages where this blank line was incorrect, so i won't bother to repeat that. the list of pages that _do_ have a blank line is here: > http://z-m-l.com/go/mount/mount-roger_is_wrong-upper.txt the list of pages that _do_not_ have a blank line is here: > http://z-m-l.com/go/mount/mount-roger_is_wrong-lower.txt so, if people look at what i said, and look at the facts, i was right. > I simply don't care who is "right" here. well, gee, roger, i don't know what to say in response to that. if i say something, and i'm right, and you say i "mischaracterized" what was done, when i actually didn't, then perhaps it is _best_ that you "simply don't care" who is "right" and who is "wrong"... stick your head in the sand. refuse to look at the actual data. i, on the other hand, _do_ care what is right and who is wrong -- not so much on this particular question but in questions in general -- because i want to know what the facts are, and what the truth is, because a general ignorance of _the_truth_ doesn't do me much good. (and an outright rejection of it -- as you've done here -- is dangerous.) i care very deeply about which _position_ is "right", because i want to _adopt_ the right position, because it's silly to cling to the wrong one. so i don't put my head in the sand. i look at the actual data... > That "P0" round is great if there are proofers available who > would work in that round if there are volunteers who will make changes individually, over and over and over again on page after page after page, i am _quite_sure_ there are volunteers who would _love_ to have the power to make _global_corrections_ in one fell swoop, since it's a lot more efficient. and a sense of agency and efficacy are _important_ to volunteers... > and if they are sufficiently talented with regexs > to find the trouble spots they don't need to be talented with reg-ex. they just need a tool. > or if a tool were available to take them to the trouble spots. right. which is what i've been saying for about 5 years now. welcome to the party. > From what I read, the tool you announced as imminent > may not be coming soon--I hope I'm wrong about that who told you that? i told you that you could have it right now, if you wanted it right now, all you had to do was ask me for it and agree to write up a report on it and send it to this listserve. so, do you want to see it? or not? > It's unfortunate that no part of your code is available in source. yet another case where you're not attending to what's important... here's how you set up an editable field using the realbasic compiler: you drag such a field onto the window. boom, you've got an editfield. it's a styled editfield, meaning you can have italicized and/or bold text. colors, and even some other stuff, like alignment and super/subscripts. plus, of course, the normal stuff, like choice of font and fontsize... does that help you set up an editable field in perl or ruby or java? i wouldn't think so. or, you place a "canvas" using realbasic by dragging it onto the window. then you load the image of choice into the canvas. does that knowledge help tell you how you would do the same thing in perl or ruby or java? i wouldn't think so. there is some code that would help you. for instance, here's how i split the overall book text-file into pages, for a page-based display: > pages=split(book," {{") and here's how i split it into paragraphs, to do that kind of analysis: > paragraphs=split(book,chr(10)+chr(10)) that's pretty close to the command that you would use, in perl anyway. so i guess, in some sense, it would give you an advantage to see that... but, really, knowing how to do a "split" is something you already know. isn't it? > Someone posted that perhaps that premature availability would > lead to unneeded criticism. personally, i ain't afraid of criticism. like any software that you build for yourself, there's kinks in this thing. but i expect that another programmer like you cuts some slack for that. and even if you don't, who cares? if someone decides not to even look at my program because of the criticism you have leveled against it -- whether it was correct or not -- then that's their loss. no skin off my nose. > You've decided to withhold the source, > which makes me wonder if it exists at all. you know, roger, that just makes you sound stupid. i mean, really stupid. i told you point blank that if you wanted to see the app, i'd send it to you. so, if you "wonder if it exists at all", why don't you just ask to see a copy? i just can't fathom the stupidity of that. perhaps you think that, instead of writing the source-code, i created the app by waving my hands in the air magically? a good thing you don't want a dialog, because you're not holding up your end of the bargain anyway, not when you say stupid stuff like this. > Actually, that's not particularly helpful or appropriate, to me. well, gee, roger, i'm up to item #23 on a list of routines that could have been used in a good preprocessing methodology that would have found roughly 150 lines that contained errors in the "mountain blood" book, errors that could have been quickly and easily corrected before the text went in front of any proofers. if that's not helpful or appropriate to you, then i don't really know what more i can do to explain it to you... > My observation is that since you were banned from DP that > the forums there have been much more productive, they returned to the groupthink that prevailed before i was there, yes. if you like that type of thing, fine... but when the future looks back on that archive, it will tell you that my posts there were the most insightful ones in the entire bunch. and they will laugh at you for your stupidity in drumming me out... (there you go, marcello, another juicy quote for your fansite.) > It's sad that, given you feel you have good ideas, you can't find > a way to present them without making it a competition and > without making it personal. There is no "I" in Project Gutenberg. oh please... you attack me, personally, and then say that _i_ am the one who is making it "personal"? and you think people don't see through that ploy? i've written dozens of posts here where i examined your work, and never cast anything personal on you... *** anyway, like i said, whatever you want. if you don't want a dialog, then i'll be happy to return to the monolog i was having before... -bowerbird these are the pages that had a blank line at the top of them. i believe they all start with a capital letter or a double-quote. if i'm wrong about that, please let me know... 007.png // ONE 009.png // The fiery disk of the sun was just lifting above 010.png // From the vantage point of the back porches of 011.png // II 015.png // Gordon Makimmon, with secret dissatisfaction, 017.png // Ill 018.png // With a sharp flourish of his whip Gordon urged 019.png // IV 023.png // They rose steadily, crossing the roof of a 027.png // Gordon Makimmon gazed with newly-awakened 032.png // VI 035.png // "Four. They're real buck, and a topnotch article. 037.png // VII 042.png // The other consulted the book. "Two years, a 043.png // "I can give you fifty dollars," Gordon told him, 046.png // "By God!" he exclaimed, suddenly prescient, 047.png // "I'm not like that," Gordon informed him; "it's 048.png // VIII 052.png // IX 056.png // X 059.png // A preliminary drink was indispensable; and, 062.png // XI 064.png // "You got enough, all right," Em agreed. "Now, 069.png // XII 071.png // XIII 078.png // XIV 080.png // XV 081.png // "That's not correct," Simmons informed him 082.png // XVI 086.png // XVII 087.png // "Clare's dead," Gordon replied involuntarily. 088.png // XVIII 090.png // Gordon Makimmon stood at the end of the porch, 092.png // XIX 095.png // "It was certainly nice-hearted of you to come to 096.png // "I wanted to tell you," she said finally, with palpable 097.png // Then, when Zebener Hull's corn failed, 'I'll trouble 100.png // XX 101.png // The following morning found him on the front 103.png // XXI 105.png // XXII 109.png // Inan instinctive need for human support, the reassurance 110.png // XXIII 117.png // XXIV 120.png // This, he told himself complacently, was but a description 121.png // I might write there, but I'd lose time and money. 122.png // "men don't like me, they are afraid of me; but the 125.png // "What does that matter? don't you love me? 127.png // XXV 128.png // The large, suave figure of the Universalist minister, 132.png // The minister's wife inserted in the door from the 133.png // XXVI 135.png // The woman's face was bitter, her body tense. 138.png // XXVII 143.png // "Thank you," she told him seriously; "it will 144.png // "Do you care as much as that?" She laid her 147.png // TWO 149.png // In the clear glow of a lengthening twilight of 156.png // Don't disturb yourself; yours is the time for pleasures, 159.png // "Kick him again, Buck," he said; "kick him 160.png // II 161.png // Lattice, in white, with a dark shawl drawn about 164.png // Ill 165.png // "Here, General, here," Gordon commanded, and 169.png // After he had spent a limited amount, the principal 171.png // IV 173.png // Going to start that song? That'll come natural to 175.png // It was late when they returned from the farm. 176.png // The form above him leaned forward over the railing. 177.png // His foot struck against a chair, and his hand caught 178.png // VI 180.png // Fork next week?" she demanded. "I have never 181.png // He rose to leave, and she held out her hand. At 182.png // VII 183.png // Lettice was--superior; he recognized it pride-fully. 184.png // VIII 191.png // IX 192.png // "I do! Idol" He turned and left them, striding 193.png // He was fascinated by her naked, shapely arm; it was 194.png // "Why I--I got some money; that is, my wife 197.png // "Haven't you got enough at home," Buckley demanded, 199.png // X 205.png // XI 209.png // "Five years ago," he told her, "if you had tried 210.png // "You are a gentle object," he satirized her, loosening 211.png // "perhaps I'll get a thrill from that." Her voice 212.png // "Back to this wilderness," she scoffed; "any one 215.png // "Don't desert me; I am entirely alone except for 216.png // XII 218.png // "Whatever I say is good enough for Lettice," Gordon 220.png // "There's no good," he resumed, "in you and me getting 221.png // XIII 223.png // "Barnwell might cross him," she answered; and, 225.png // "You're not a camel," she truthfully observed, 226.png // Berry pronounced, "and never come to any satisfaction. 232.png // "I want a cheerful wife, one with a song to her, 234.png // XIV 235.png // "I threw the stone that hit Buck, didn't I! I 236.png // "Well, you don't have to stand and talk like I 239.png // "The jobber sent it up by accident," he explained; 243.png // XV 244.png // Hedescended, beyond the ridge, into the fact of 247.png // He was now, he realized dimly, at the crucial 249.png // XVI 250.png // "All lace and webby pink silk and ribbands underneath," 252.png // XVII 256.png // XVIII 260.png // MOUNTAIN BLOOD 261.png // She gazed about at the valley, the half-distant maple 264.png // XIX 266.png // Lettice was so young, he realized suddenly. [ 269.png // XX 271.png // It had been wonderfully comfortable in the evening 275.png // XXI 277.png // He gazed at her for a moment, at the shadows like 279.png // THREE 281.png // Lettice's death Gordon was fetching 284.png // II 286.png // Her voice, too, was like Lettice's--sweet with 289.png // Ill 291.png // IV 294.png // The couple grasped avidly at the opportunity 295.png // He was a youth of large, palpable bones, joints 296.png // VI 297.png // "Why should you?" Gordon interrupted 300.png // VII 302.png // "and twist the head off that dominicker chicken. 304.png // VIII 305.png // Mrs. Hollidew dead. Undisturbed in the film of 309.png // IX 311.png // X 312.png // Itwas the priest, Merlier. 314.png // T 315.png // "I heard her, but I'd ruther sit right where I am." 316.png // "Why, William Vibard! what an awful thing to 317.png // "They're in the stable," William Vibard answered 319.png // XII 320.png // "As an old friend," he declared, "an old Presbyterian 321.png // "I'm not aiming at anything," Gordon answered, 322.png // "Never kept much of anything, have you, any of 323.png // "I intended to come to you about that." 324.png // "You forget, unfortunately, that.1 am forced to 326.png // XIII 329.png // XIV 330.png // Suddenly he was in a hurry to get away; he drew 331.png // XV 332.png // The sales made to Valentine Simmons were, invariably, 335.png // XVI 336.png // A number of horses were already hitched along 340.png // The thought of the storekeeper was lost in the 341.png // XVII 342.png // He had not been in the house since, together with 343.png // "Get out of here!" Gordon shouted in a sudden 344.png // XVIII 347.png // General Jackson moved forward over the porcK. 349.png // XIX 351.png // XX 354.png // XXI 356.png // XXII 357.png // His wife, Lettice, how young she was smiling at 358.png // Gordon, doubting whether the horses' shoes had 361.png // XXIII 362.png // He swayed, but preserved himself from falling, 363.png // They were, Gordon knew, not half way up Buck 366.png // XXIV 367.png // "it's Edgar Crandall. You'll take pleasure from these are the pages that did not have a blank line at the top. i believe they all start with a lower-case letter (or a bracket). if i'm wrong about that, please let me know... 012.png / night before, evading such indirect query as Makimmon 013.png / lips, they had subsided into an unintelligible mutter, 014.png / or as she gravely thanked him at the end of the day's 016.png / half-heard conversation behind him; he spoke to 020.png / youth. He lounged over the road in a careless manner 021.png / this chance to the utmost with Morley's Raiders 022.png / married sister, completing the tale, lived at the opposite 024.png / forest swept down in an unbroken tide to the porch 025.png / with his mind pleasantly vacant, lulled by the monotonous 026.png / and in the end persuaded her. The stranger continued 028.png / one will know." He could not resist adding, 029.png / a sibilant exclamation, and Lattice Hollidew covered 030.png / them. The stranger consulted a small map. 031.png / hidden space, the village lay along its white highway. 033.png / and greenish, with an incomplete mustering of buttons, 034.png / girls," he pronounced coolly; "but he'll be after them 036.png / before had crippled their resources; his last Christmas 038.png / thin, sensitive nose, and a colorless mouth set in a 039.png / on the kitchen wall, where, in the watery light of a 040.png / again?" he asked solicitously; "shall I get you the 041.png / ing or benevolent sentences; these, with appropriate 044.png / out the ability to pay for a bag of Green Goose 045.png / marked precisely, over his shoulders, "the white 049.png / lying, indomitable determination, asserted itself--he 050.png / idea.--He would pay the customary substitute to 051.png / mance of his sister's courtship; the high, strident 053.png / he had limited himself in thought, but his entire 054.png / raw clay and narrow, wood sidewalks; they were, 055.png / wild. Could he afford to lose that amount from his 057.png / fat, and oddly damp and lifeless. He could see her 058.png / high shoulders, the long, pale face, the long, pale 060.png / visible money. "A dollar a go?" Jake queried, 061.png / slowly and rolled like a flash over her plastered 063.png / cern now was to get away, to take the money with 065.png / _{cr}ippling, the other. A chair fell, sliding across the 066.png / rapidly losing power. The woman threw herself on 067.png / up in front of his head; and an intolerable pain shot 068.png / lost. He clung to it; pressed his breast against it; 070.png / tination. His coat, soiled and torn, was buttoned 072.png / don knew, a sovereign and inevitable remedy for all 073.png / -Clare dangerously ill ... a question of dying, 074.png / dark, sliding water, and the mountainous wall 075.png / didn't hear ... oh, there's nothing in it if you 076.png / supper," she worried, when he had told her; "and 077.png / getically, "it will cost a heap of money; how will you 079.png / directed; "and I'll be down to see you ... yes, 083.png / and the small assemblage of merely idle or interested 084.png / nip?" he asked, in a solemn, guarded fashion. 085.png / hat drawn over his eyes, a piece of pasteboard in 089.png / ners, a subdued, red riot of the summer, the sun 091.png / but her eyes were unwavering--they held an appeal 093.png / tured her attention and interest; he had not thought 094.png / garment of stars drawn from wall to wall. There 098.png / nuto passage of a symphony; "but it's all one to me 099.png / dressed the other, "don't walk back here, don't come 102.png / pockets in search of the proof of his assertion. In 104.png / ished, leaving the countryside sparkling and serene 106.png / printed upon the otherwise spotless board floor, 107.png / but he belted them into baggy folds. The other appeared 108.png / sorry that he had obeyed the fleeting impulse to enter. 111.png / going to die two or three times the year, and bother 112.png / so utterly, so disastrously, so swiftly upon his complacency, 113.png / the whole worthless lot?" Bartamon demurred: 114.png / turn of the wrist, skilfully avoiding the high underbrush, 115.png / upon a curtain of old blue velvet. He cast once 116.png / ers, from the farm. As he approached he saw that 118.png / turned Lettice Hollidew stood with her tiresome 119.png / fleeted in the warmer tones of his replies; a new 123.png / girl until--until Buckley.,. until to-night, now. 124.png / luctant eagerness. He kissed her again and again, 126.png / ment, and opened them with an effort. The whippoorwills 129.png / of the "small occupations," the minister's reputation 130.png / less suns of the August valleys. He was as seasoned, 131.png / in a rapid sing-song by the circuit rider, Gordon saw 134.png / the sky. He recognized the sharply-cut silhouette 136.png / was smoothly rounded, provocative; its graceful proportion 137.png / across her face, and she turned and disappeared. 139.png / passed, Gordon gathered the impression of a dark 140.png / he might have had it all. He gazed cautiously, but 141.png / garden patch beyond, Mrs. Caley said. Gordon 142.png / ing that it must be a messenger from the village, dispatched 145.png / soft contours of her virginal breasts, the bracelets 146.png / [Blank Page] 148.png / [Blank Page] 150.png / as I've a mind to?" Gordon demanded belligerently. 151.png / and white, with an occasional red thread drawn 152.png / recorded, the elbow crooked. "Don't forget his 153.png / the church all regular and highly fashionable. He 154.png / examined the details of your late father-in-law's 155.png / came grave at the contemplation of the amount involved. 157.png / legs were ludicrously, inappropriately, long and 158.png / precariously rewarded labor with the stubborn, inimical 162.png / he's as gritty as--why, yes, I do, I'll call him General 163.png / center occupied by a large silver-plated castor, its 166.png / throaty voice, "I'm afraid.... Tell me it will be 167.png / the voice, as it were, of a sinister, tin manikin galvanized 168.png / you'd like to hear General Jackson sing; he's got 170.png / had precipitated this rebellion, this strife in which 172.png / and playing him out. Come here, General JacK-son." 174.png / ing General Jackson at his heels, he picked the dog 179.png / stage-driving days, of the younger years. Her manner 185.png / were close-lipped, somber. The men were sparely 186.png / gathered in the noisome shadows, bottles were 187.png / aside, and a woman walked stiffly out, her hands 188.png / flood. "At last it's about your hearts, your hearts 189.png / duced a small jug. He wiped the mouth on his 190.png / in search of Meta Beggs; perhaps, after all, she had 195.png / for it. Almost everybody wants the same thing--plenty 196.png / hind, as the former made his way toward them. 198.png / surface of blood and hair and dirt. Buckley's eyelids 200.png / off me. I was a year and a half there, when--when 201.png / shafts of the trees on the lawn. Supper was in progress 202.png / laid it on the indistinct bed, and moved to the mirror 203.png / by the aid of a hand lantern. He was reluctant to 204.png / women since the dawning of consciousness, that it 206.png / slightest opening; and he continued uncomfortably 207.png / eluded him. "Please," she protested coolly, "don't 208.png / 'but at night--satin gowns with trains and bare 213.png / surprising fore-knowledge of the County, who had 214.png / atory position. He would extract the last penny of 217.png / always seem to be around, to get talked about, when 219.png / existing conditions. "Your wife's estate controls 222.png / with a scant black ribband about her waist, her sole 224.png / rested at Lettice's hand, and, before Gordon, a portentous 227.png / nolia flowers, would never thicken and grow rough. 228.png / rose in Berry's pallid countenance, Sim's portion 229.png / shaggy horses as they lay clumsily down to rest, on 230.png / a long drink. He drank mechanically, without any 231.png / untrimmed. A chair by the bed bore Lettice's 233.png / ing her elbow, shook her. She was as rigid, as unyielding, 237.png / the pleasantest body you'd meet in a day on a horse. 238.png / accomplished fact; Lattice's wishes, her quality of 240.png / bag, he had lamed a horse--a satisfactory driver 241.png / ing, in the sooty shadows. With the necklace of 242.png / night. It was late afternoon of the day on which he 245.png / lace. Finally he found her; or, rather, she slipped 246.png / you. Some people even like it. A man who came 248.png / deepened to its darkest hour; the moon, in obedience 251.png / sible act of cowardice--Lattice, a girl, blinded by 253.png / that gaping, insatiable chasm. He was conscious 254.png / lighter sky. The foliage of the maples, stripped of 255.png / was driving, and by her side ... Lettice! Lettice 257.png / casual subterfuges. The evasion which he summoned 258.png / plicity, the weakness, the sensual and egotistical desires, 259.png / her youth haggard with apprehension and pain, the 262.png / it was worse. The buggy, badly hitched, bumped 263.png / silence of months, dispelling the accumulating ill-will 265.png / cfeeded from ... it wasn't as though he had gone 267.png / -what man had not?--but this was different; this 268.png / which totally misrepresented him.... He would 270.png / men; it tampered ferociously with the beauty, the 272.png / horse's hoofs on the road above; the sun moved 273.png / esty on the bed; "there was a good bit I didn't get 274.png / over the uneven boards of the porch. At this hour 276.png / coldness seemed to come through the cover to his fingers. 278.png / a box, the lantern at his feet casting a pale flicker 280.png / [Blank Page] 282.png / scraped of mud, bore long cuts across the heels, 283.png / after him. Then, as he turned, he saw that there 285.png / since he had "called out" Gordon's home; the 287.png / waist had been crisply ironed, her shoes were rubbed 288.png / his way. But, almost immediately, he stopped. 290.png / ing which even the auctioneer grew apathetic. He 292.png / now, suddenly detached from the aimless procession 293.png / in rigid rows on the dresser; the pots were scoured 298.png / the same thing in the Bottom. Ask anybody who 299.png / he pulled his hat over the flaming helmet of hair. 301.png / at either side of the large, uncut stone at the threshold; 303.png / he admitted; "but I haven't got to. It's enough to 306.png / shoulders of men momentarily forgetful or caught in 307.png / the touch of a magic wand. He had never realized 308.png / say, three per cent, grant extensions of time wherever 310.png / a man with a round, freshly-colored countenance, 313.png / that know tell you," Merlier paused at the door, "the 318.png / like that ... delicate--" He knelt, with an expression 325.png / an estate estimated at--" he stopped from sheer 327.png / municate with the Tennessee and Northern Company, 328.png / them to take or leave. But, if they delayed, watch 333.png / hardly more alive than the photographed clay of 334.png / inhabited desolation, in a black chasm filled with 337.png / called, in Greenstream, the Portugee; every crop he 338.png / who paid for and removed the bodies of dead animals. 339.png / controlled the Bugle in addition to countless other 345.png / window, saw that the sweep by the stream was filling 346.png / lie cried out of his bitterness of spirit, "but I'd ruther 348.png / light, blurred, mingled, in his vision. He put out 350.png / the iron-like earth. In the pale circle of the lantern 352.png / him. He thought, in sudden approbation of a part, 353.png / stage; formerly Clare had attended to the house for 355.png / ance, over the obscured way. The stage mounted, 359.png / able,--the glassy road enormously increased the labor 360.png / other's arm sweep up.--The switch fell viciously 364.png / village soon's I can; and here you drag and hang 365.png / ing, dead planet. Gleams of light shot like quicksilver 368.png / chamber of the safe. A flickering desire to see led ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/cfc57aac/attachment-0001.htm From vze3rknp at verizon.net Wed Jul 23 15:33:00 2008 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Wed, 23 Jul 2008 18:33:00 -0400 Subject: [gutvol-d] not a dialog In-Reply-To: <4887AA74.6000601@pobox.com> References: <238672185.93911216840318819.JavaMail.mail@webmail09> <4887AA74.6000601@pobox.com> Message-ID: <4887B19C.3070108@verizon.net> Adding these checks to the proofing interface at DP is something that I've wanted for years now. Wordcheck is the first step in that direction, and I keep hoping that some developer will take on the task of writing a companion tool that will make use of regex's and other useful things for pointing out potential errors. This is such an obvious tool that we'd even thought of it before bowerbird mentioned it in the forums. We don't have it yet simply because none of our volunteer developers has been willing to tackle it. If I could wave a magic wand and make it happen, we'd have had it years ago. JulietS Roger Frank wrote: > Joshua Hutchinson wrote: > > | Just wanted to pop in to ask if you (or anyone else) has > | looked into incorporating these checks into the proofing > | interface at DP? > > That would be a big boost to productivity. The difficulty > for me is that I'm comfortable with Ruby and Perl but > uncomfortable with PHP, and I think that's an important > deficiency for anyone wanting to integrate it at DP. > That's why for me it's a standalone utility, like guiprep, > only written in Ruby--it's just my limitation in being able > to put it inside a wrapper with something stronger than a > textbox widget. If I could find the equivalent of guiguts' > built in editor/presentation manager, only written in Ruby, > I would certainly use it. That would at least make it > interactive in a "proofing round 0" sense. > > So bottom line, for me the answer is that it's only a > "I wish I was smart enough to do that" kind of thing. As > a proofer myself at DP, I agree it would be a big win. > > --Roger Frank > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > No virus found in this incoming message. > Checked by AVG - http://www.avg.com > Version: 8.0.138 / Virus Database: 270.5.5/1568 - Release Date: 7/23/2008 6:55 AM > > > > From rfrank at pobox.com Wed Jul 23 15:38:52 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 16:38:52 -0600 Subject: [gutvol-d] woman in her own right -- 001 In-Reply-To: References: Message-ID: <4887B2FC.2060607@pobox.com> Bowerbird at aol.com wrote: > ... the runheads and pagenumbers > are in the text. what's up with that? d.p. policy is to remove them, > and it's something the computer can do quite easily, so why have the > _proofers_ spend their time and energy doing it instead? that's b.s. Just a point of clarification: yes the page headers have been retained on that (and two other) newcomer's only projects. I have started over a hundred books for newcomers and have provided individual personalized feedback to every P1 proofer on each of those books. That's well over a thousand PMs to the new proofers. What I've learned is that many of the corrections on books that I preprocessed and removed the page headers on was in the P1 not getting the page breaks right, even with the top-of-page code in place. On these last three books, I did the equivalent of leaving the "remove page headers" step out of a traditional guiprep run. Because they are Newcomers Only/Rapid Review projects, I'll know right away if forcing a decision by the P1 regarding to what to do at the top of a page is worthwhile. By the way, when the project was created (see the Project Discussion), I announced "In this project, all page headers have been retained. Follow the standard guidelines to remove them, including adding a blank line if the top of the page starts a new paragraph." Proofers don't have to work on the project if that is onerous to them, nor should they be surprised if they choose to participate. That's what the project thread is for. These three books are an experiment, and since I for one don't already know all the right answers, it is how I discover more about proofers, proofing, and the price of getting better output. --Roger Frank From Bowerbird at aol.com Wed Jul 23 15:58:37 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 18:58:37 EDT Subject: [gutvol-d] woman in her own right -- 001 Message-ID: roger said: > yes the page headers have been retained > on that (and two other) newcomer's only projects in "mountain blood", you left the pagenumbers, even though it would've been quite simple to write code to remove them... in "the crevice", there were hundreds and hundreds of cases where "blaine" was misrecognized as "elaine", which could've been fixed book-wide with a global change. examples abound. > What I've learned is that many of the corrections on books > that I preprocessed and removed the page headers on > was in the P1 not getting the page breaks right, > even with the top-of-page code in place. why don't you "get the page breaks right" in _preprocessing_? and then, when a diff comes up, and you see a proofer made a change to what was _already_ correct, you can inform them? this is one of the biggest values of preprocessing, that when a change is made, it can be an indicator of a misinformed proofer. > These three books are an experiment, and > since I for one don't already know all the right answers, > it is how I discover more about proofers, proofing, > and the price of getting better output. well, i don't know all the right answers either, but i know that making humans do something that the computer could do faster and easier is something that makes my stomach queasy... runheads are such an _obvious_ example of this, it would be irresponsible of me -- when analyzing this experiment -- to fail to mention such a thing. so just because you don't know "all the right answers" doesn't mean you stop using your brain to intuit them. and really, if we can't agree on the _obvious_ things, then there isn't much sense in having a dialog, or even bothering to type posts back and forth to each other... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/6c8a1b32/attachment.htm From Bowerbird at aol.com Wed Jul 23 16:03:40 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 19:03:40 EDT Subject: [gutvol-d] not a dialog Message-ID: dkretz said: > You may remember that I implemented a new proofing interface > a year or two ago, which provided a "preview" mode showing > real italics, etc. That has since added a quote-matching display, > and a punctuation reasonability-checker. They may still be on > the dev server - I haven't checked for a long time. juliet said: > We don't have it yet simply because > none of our volunteer developers > has been willing to tackle it. if somebody can sort this all out, do please explain it to me, ok? thanks. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/ed51fb5d/attachment.htm From jayvdb at gmail.com Wed Jul 23 16:23:12 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Thu, 24 Jul 2008 09:23:12 +1000 Subject: [gutvol-d] is this a dialog? In-Reply-To: References: Message-ID: On Thu, Jul 24, 2008 at 3:49 AM, wrote: > john- > > wow. i certainly wasn't expecting anything like _that_. > what a nice surprise. i'm bowled over. > > i will be happy to go over and take a look at wikisource. > > it would be a pleasure to offer my constructive criticism > to an entity smart enough to actually treasure such input. > > and i'd be honored to help improve your infrastructure... Great; if you have questions, ask at either: http://en.wikisource.org/wiki/Wikisource:Scriptorium http://en.wikisource.org/wiki/User_talk:Jayvdb Don't worry about form or format; if you mess up we'll fix it and let you know for next time, and thank you for trying. > right from the get-go -- with the wiki structure and your > ability to run bot-based error-finding routines -- i'd say > you have some fantastic potential there. really fantastic. > > my apps are written in basic, so my code won't help you, > but i'm skilled at expressing them in pseudo-code, so if > you've got web-programmers to implement my routines, > we'll be able to work together. We have programmers of all flavours. Opportunity cost will mean your ideas will have to be as good as, or better, than the constant stream of user requested enhancements, but it sounds like that wont be a problem. > and if it wasn't clear, my offline tools are cross-plat apps > that are available at zero cost. (i'd guess that people still > are more efficient doing this work offline, but i'm willing > to let some crafty web-programmers prove me wrong...) We havent done much offline processing, however I can see what you are saying. I converted the following work from PG etext to wiki structure and format using a once-off script. Once converted, a bot uploaded it using "pagefromfile.py" http://en.wikisource.org/wiki/A_Short_Biographical_Dictionary_of_English_Literature http://meta.wikimedia.org/wiki/Pagefromfile.py Most of our tools are designed to work on the users machine, interfaced to the _live_ wiki database. i.e. our software is decentralised. Our development is decentralised. The software that runs the Wikisource (and Wikipedia) system _must_ be open source, however we want people to do whatever pleases them. A bot edit is essentially just a human edit, The wiki system doesnt care how the edit is made. We have _social_ rules around bots, some of them unwritten, because they can make a mess very quickly and it will take a long time to fix the mess. Essentially, the rules boil down to: dont make a mess. We have frameworks in python, perl, php, etc. so if there are existing error checking/fixing programs or routines already available, _anyone_ can integrate those into bot, automated or requiring human intervention, that proceses up any page that is currently marked as "not proofread". I suggest that the bot should run without making human intervention, because humans will be proofreading the text anyway. If the bot is consistently making good changes, it will go through an approval process which brings two benefits: 1) it is then approved to run at full speed, and 2) it is hidden from the "Recent changes" view humans watch to review ongoing changes. If a bot finds errors that it cant fix, it could report the details of those on the "talk" page that accompanys every transcription page. If the bot is regularly bringing pages up to "proofread" quality, it could even be approved to mark a page as such if the bot can find no further issues with the page. Those pages still need to be verified by a human, so any error can still be identified and fixed. A side benefit of this is that pages which the bot doesnt mark as "proofread" will be _more carefully_ inspected by a human, because in the back of everyones mind will be: there is something about this page that the bot didnt like. > i will respond here when i've taken a look at wikisource, > just to show the kind of interaction p.g. could have had, > but if we continue on for long, we can take it elsewhere... Looking forward to some fresh ideas and criticism. -- John From Bowerbird at aol.com Wed Jul 23 16:40:58 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 19:40:58 EDT Subject: [gutvol-d] is this a dialog? Message-ID: john said: > Don't worry about form or format; if you mess up we'll fix it > and let you know for next time, and thank you for trying. i like that. > We have programmers of all flavours.? good to hear. > Opportunity cost will mean your ideas will have to be as good as, > or better, than the constant stream of user requested enhancements wouldn't have it any other way. > We havent done much offline processing, > however I can see what you are saying. ok. > Most of our tools are designed to work on the users machine, > interfaced to the _live_ wiki database. that sounds like a good approach, best of both worlds. i'm accustomed to thinking along the lines of what p.g. does, where a book is done offline and then uploaded to the project. but banana cream downloads scans if they're not on your machine, so it shows the start of an effort to bridge the offline/online chasm. > i.e. our software is decentralised.? Our development is decentralised. ok, good. > If the bot is regularly bringing pages up to "proofread" quality, > it could even be approved to mark a page as such if the bot > can find no further issues with the page.? right. that's the nature what i meant when i talked about "respect" that is due to the human volunteer proofer. don't give them a page which is still marred by deficiencies that even the machine can find. that's the kind of thing that should be done in _preprocessing_... > Those pages still need to be verified by a human, > so any error can still be identified and fixed. excellent. > A side benefit of this is that pages which the bot doesnt mark > as "proofread" will be _more carefully_ inspected by a human, > because in the back of everyones mind will be: there is something > about this page that the bot didnt like. better if the bot can say exactly what it is that it didn't like... > Looking forward to some fresh ideas and criticism. sounds like your head is on straight. that's very refreshing. i've carved out time tomorrow to take a good look at it... :+) -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/5c9200aa/attachment-0001.htm From dakretz at gmail.com Wed Jul 23 16:59:25 2008 From: dakretz at gmail.com (don kretz) Date: Wed, 23 Jul 2008 16:59:25 -0700 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 28 In-Reply-To: References: Message-ID: <627d59b80807231659n75615f6r4469f55d8e5460d4@mail.gmail.com> rfrank, Bird, here are the EB regexes(in the re.vim file). Have at 'em! :) Note that, in some cases, I've been tracking how many changes they incurred. That's dependent on the order in which they were invoked, however, since they overlap. Also, several can be (should be) used recursively. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/8b758c24/attachment.htm From rfrank at pobox.com Wed Jul 23 17:39:33 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 18:39:33 -0600 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 28 In-Reply-To: <627d59b80807231659n75615f6r4469f55d8e5460d4@mail.gmail.com> References: <627d59b80807231659n75615f6r4469f55d8e5460d4@mail.gmail.com> Message-ID: <4887CF45.4040807@pobox.com> don kretz wrote: > rfrank, Bird, here are the EB regexes > (in the re.vim file). > Have at 'em! :) Got 'em. Looks like really good work. Thanks! --Roger Frank From rfrank at pobox.com Wed Jul 23 18:15:12 2008 From: rfrank at pobox.com (Roger Frank) Date: Wed, 23 Jul 2008 19:15:12 -0600 Subject: [gutvol-d] BB | /dev/null Message-ID: <4887D7A0.9030405@pobox.com> Bowerbird wrote: | and really, if we can't agree on the _obvious_ things, | then there isn't much sense in having a dialog, or even | bothering to type posts back and forth to each other... Perfect! This gutvol-d forum will be better for it. Sometimes it's difficult not to respond to some of your posts, which I'm beginning to think is intentional. To help me not be tempted, I'll just put you in my kill file and suggest you do the same for me. I do what I do for fun. Being insulted, being called stupid, having the effort of hundreds of hours of coding be denigrated isn't fun. I want to live with intention, and too much time spent on gutvol-d is not on my chosen path. I want to continue to learn, and lucky for me there are many places and many people outside this list that can provide that opportunity. I want to appreciate my friends, and over the years I have made many friends at DP where relationships are based on mutual respect and the bonds that grow when experiences--real work, projects, code development and such--are shared. And mostly I want to do what I love, and so I'll go and do that. --Roger Frank From Bowerbird at aol.com Wed Jul 23 19:16:09 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 23 Jul 2008 22:16:09 EDT Subject: [gutvol-d] BB | /dev/null Message-ID: roger said: > To help me not be tempted, I'll just put you in my kill file > and suggest you do the same for me. oh no, not so fast, pilgrim. i don't just put people in my kill file. they have to _earn_ that distinction! you've said a couple stupid things, and dipped into the ad hominem, but you haven't quite earned it yet. :+) > I do what I do for fun. hey, me too. :+) > Being insulted, being called stupid, > having the effort of hundreds of hours of coding be denigrated > isn't fun. i've been the victim of insults here countless times. and i agree with you, it isn't fun... but i'm tough... i've also had my "hundreds of hours of coding" be denigrated. some even "wonder if it exists", believe it or not. that strikes me as humorous, but i still can't stretch to call it "fun". so i agree. but again, i'm tough, so it doesn't bother me... oh, and i never "called you stupid"... that would be ad hominem, and i stay away from that type of thing. what i said was that "what you just said was stupid". do you get the difference? one way is saying that the _person_ is stupid, the other way is saying that the _statement_ is stupid. statements can be stupid. they can be really stupid. and i suppose if you made _enough_ stupid statements, i would feel confident saying that _you_ were stupid. but in the meantime, it's enough to label the _statements_ as being stupid. because in a search for _the_truth_, it is _imperative_ that you call out stupid statements as _being_ stupid, or the fact that they hang around gives 'em credence. if you want to discuss whether it was or was not stupid, that can be done. i stand by my claim that it was stupid. i didn't make it "personal". if you take it that way, it's you exhibiting a behavior that you have _chosen_ to exhibit... > I want to live with intention, and too much time > spent on gutvol-d is not on my chosen path. that's cool. my intention is to make the lobby of the project gutenberg library a lively place to be. i spend as much time as necessary to do that. > I want to continue to learn, and lucky for me > there are many places and many people > outside this list that can provide that opportunity. that's cool. me, i have the opportunity to hang with other performance poets; it's our job to comfort the afflicted, and to afflict the comfortable... > I want to appreciate my friends, and over the years > I have made many friends at DP that's cool. i'd say you fit right in with that crowd. mind meld. > where relationships are based on mutual respect and > the bonds that grow when experiences--real work, > projects, code development and such--are shared. sounds a bit like disneyland, the _happiest_ place on earth. > And mostly I want to do what I love, > and so I'll go and do that. that's cool. i'll stay here and continue to do what i love. :+) *** and there you have it, folks. the dust has settled. so we can go back to what we were doing before, which is to clearly document the inefficiency which is a direct result of the terrible workflow over at d.p. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/78c720d6/attachment.htm From dakretz at gmail.com Wed Jul 23 21:43:38 2008 From: dakretz at gmail.com (don kretz) Date: Wed, 23 Jul 2008 21:43:38 -0700 Subject: [gutvol-d] wikisource Message-ID: <627d59b80807232143q7f3d936aj71f5aa8e34d7baab@mail.gmail.com> John Vandenberg, Have you ever loaded any of the Encyclopedia Britannica projects into wikimedia? Does it seem like a fit to you? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/65de2333/attachment.htm From jayvdb at gmail.com Wed Jul 23 22:23:55 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Thu, 24 Jul 2008 15:23:55 +1000 Subject: [gutvol-d] wikisource In-Reply-To: <627d59b80807232143q7f3d936aj71f5aa8e34d7baab@mail.gmail.com> References: <627d59b80807232143q7f3d936aj71f5aa8e34d7baab@mail.gmail.com> Message-ID: On Thu, Jul 24, 2008 at 2:43 PM, don kretz wrote: > John Vandenberg, > > Have you ever loaded any of the Encyclopedia Britannica projects into > wikimedia? Does it seem like a fit to you? It is a very good fit, but hasnt been driven by anyone due to priorities. Wikisource has a complete set of Catholic Encyclopedia 1913, uploaded by a dedicated soul by scraping the content from newadvent.org (iirc), converting it to wiki syntax, and pushing it into the database by a bot. We have three people who have the complete work at home to check any anonymous improvements that are made, and one person who has been slowly going through and actively improving the pages. A slow and lonely job. Recently oce.catholic.com was launched with a complete set of pagescans, which has helped us distribute this task a little, however sadly that site is claiming copyright and has added an atrocious watermark to the images. We are not yet bold enough import their images. Going back to Eb1911, almost all of EB1911 is in Wikipedia, however their aim was incorporate it as a basis for new articles. http://en.wikipedia.org/wiki/WP:EB1911 To find the EB1911 text that was imported into Wikipedia, you need to squirrel down to the bottom of the history of a Wikipedia page, or close to it. So far I have found that the text is _better_ than jrank, but nowhere near as good as the etexts that PD/PG has produced. I would like to find the Wikipedians who created these pages in Wikipedia; maybe they still have the raw text which was imported. There has been a recent discussion about EB1911 here (it is intermingled throughout the wild discussion; scan for the tables): http://en.wikipedia.org/wiki/Wikipedia_talk:Plagiarism (where I argue that it is improper for Wikipedia to be so vague about what Wikipedia text comes from the PD; clear attribution and accessibility of the original is the solution) On Wikisource, we have slowly been building a verifiable reconstruction of EB1911: http://en.wikisource.org/wiki/EB1911 and the "project page" for that effort is at http://en.wikisource.org/wiki/WS:EB1911 We have a complete set of scans in TIFF and PNG at: http://en.wikisource.org/wiki/User:Tim_Starling Enjoy, John From hart at pglaf.org Thu Jul 24 07:05:53 2008 From: hart at pglaf.org (Michael Hart) Date: Thu, 24 Jul 2008 07:05:53 -0700 (PDT) Subject: [gutvol-d] language counts 2008-07-24 (fwd) Message-ID: >From our automated count. I presume there are hundreds more, as when we hit 25,000. . . . Michael Grand total for today: 26000 22197 English en 1204 French fr 539 German de 451 Finnish fi 344 Dutch nl 319 Chinese zh 246 Portuguese pt 194 Spanish es 150 Italian it 56 Latin la 54 Tagalog tl 50 Esperanto eo 40 Swedish sv 20 Danish da 20 Catalan ca 10 Welsh cy 10 Norwegian no 7 Russian ru 7 Icelandic is 7 Hungarian hu 6 Middle English enm 6 Greek el 6 Bulgarian bg 4 Serbian sr 4 Polish pl 4 Hebrew he 4 Friulano fur 3 Old English ang 3 Nahuatl nah 3 Japanese ja 3 Iloko ilo 3 Czech cs 3 Afrikaans af 2 Mayan Languages myn 1 Yiddish yi 1 Slovak sk 1 Sanskrit sa 1 Romanian ro 1 North American Indian nai 1 Napoletano-Calabrese nap 1 Maori mi 1 Lithuanian lt 1 Korean ko 1 Khasi kha 1 Iroquoian iro 1 Irish ga 1 Interlingua ia 1 Gascon gsc 1 Gamilaraay kld 1 Galician gl 1 Frisian fy 1 Cebuano ceb 1 Breton br 1 Arapaho arp 1 Aleut ale From hart at pglaf.org Thu Jul 24 10:21:33 2008 From: hart at pglaf.org (Michael Hart) Date: Thu, 24 Jul 2008 10:21:33 -0700 (PDT) Subject: [gutvol-d] eBook Milestones Message-ID: On July 24, 2008, the original Project Gutenberg eLibrary reached 26,000 titles, which should be considered with an assortment of other titles; 1653 from PG of Australia and 509 from PG of Europe, as well as 138 from our latest PG, Project Gutenberg of Canada, and 377 in PrePrints. In addition, there were several dozen titles our programs have not seemed to manage to count as we have posted book numbers up to 26,119, with only 30-40 reserved numbers. Thus, the possible grand totals could be as high as: 26,089 from original Project Gutenberg [US copyright] 1,653 from Project Gutenberg of Australia 508 from Project Gutenberg of Europe 377 from Project Gutenberg PrePrints ====== -30 28,622 Grand total [presuming 30 numbers reserved] On this same date, The world eBook Fair made it to totals of 1 1/4 million total entries: 500,000+ The World Public Library 468,000+ The Internet Archive 160,000+ eBooksAboutEverything.com 17,000+ International Music Score Library Project 28,000+ Original Project Gutenberg eBooks 75,000+ Project Gutenberg Consortia Center ========== 1,250,000+ Grand Total From hart at pglaf.org Thu Jul 24 10:24:29 2008 From: hart at pglaf.org (Michael Hart) Date: Thu, 24 Jul 2008 10:24:29 -0700 (PDT) Subject: [gutvol-d] FIXED: eBooks Milestones Message-ID: A few little things adjusted. Please let me know of more suggestions, comments, etc. On July 24, 2008, the original Project Gutenberg eLibrary reached 26,000 titles, which should be considered with an assortment of other titles; 1653 from PG of Australia and 509 from PG of Europe, as well as 138 from our latest PG, Project Gutenberg of Canada, and 377 in PrePrints. In addition, there were several dozen titles our programs have not seemed to manage to count as we have posted book numbers up to 26,119, with only 30-40 reserved numbers. Thus, the possible grand totals could be as high as: 26,084 from original Project Gutenberg [US copyright] 1,653 from Project Gutenberg of Australia 508 from Project Gutenberg of Europe 377 from Project Gutenberg PrePrints ====== -35 28,622 Grand total [presuming 35 numbers reserved] On this same date, The world eBook Fair made it to totals of 1 1/4 million total entries: 500,000+ The World Public Library 468,000+ The Internet Archive 160,000+ eBooksAboutEverything.com 17,000+ International Music Score Library Project 28,000+ Original Project Gutenberg eBooks 75,000+ Project Gutenberg Consortia Center ========== 1,250,000+ Grand Total From Morasch at aol.com Thu Jul 24 12:43:24 2008 From: Morasch at aol.com (Morasch at aol.com) Date: Thu, 24 Jul 2008 15:43:24 EDT Subject: [gutvol-d] woman in her own right -- 002 Message-ID: here's some more data from "woman in her own right"... appended is a list of 31 more questionable lines, raising our total from our first 2 passes to 86 hits. -bowerbird > ~"TELL ME ALL ABOUT YOURSELF," HE SAID. .Frontispiece > land--open the shutters, Mose, so we can see. . . . > 'traits. . . . There, -sir, is a set of twelve > in a comfortable chair, lit a cigarette. . . . > whom he paid, would miss him. . . . > order--and then tell me what you think of it." . . . > sailing vessel, or a motor boat, obtainable? . . . > what's that you say? . . . Miles Casey?--on Fleet > Street, near the wharf? . . . Thank you!--He > it mild. . . . Betty Whitridge and Nancy Wellesly > of something over seventy-five. . . . That is about > across on the other. . . . Now," as they wound up > any one reads that letter, the jig is up for us. . . . > letter and the money were gone. .... > Lie low. . . . He's not coming this way--he's going > to inspect the big trees, on our left. . . . They won't > lines drawn from them intersect? ~" . . . > side. . . . "Now, sir, what is it?" as the flaps > equal! . . . Now, if you'll be quiet a moment, like > you'll not be averse to hear. . . . So, that's better. > ~. . . Thank you! Now, you may arise and shake > he not? . . . > fallen, by adversity, from better things. . . . > A little of all three, he concluded. . . . But, > -of the stocks and bonds, from the Trust > himself. . . . > ..........What do you make of it? ~" he > be found. .*. . It makes everything seem very real > "Better be a little careful, Bill! "he said. . "I > fell to thinking. . . . Presently, worn out by > "We'll have the full effulgence, if you please." . . . ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080724/7598deeb/attachment.htm From Bowerbird at aol.com Thu Jul 24 12:46:30 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 24 Jul 2008 15:46:30 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 023 Message-ID: 23. search for all lines with a dash followed by a capital letter. a small number of lines (2) _starting_ with a dash were wrong, since that dash was a misrecognized em-dash; they were fixed. > -Clare dangerously ill ... a question of dying, > -Why! what's the matter with you, Makimmon? another line had a word misrecognized, including a dash-capital-i: > face, with its heavy, good features and slow-Idndling and there were two other lines where the dash-capital was correct: > and she sat on the bed with a "G-G-God!" Jake > On an afternoon of mid-August Gordon was 3 lines were corrected, for a grand total of 201, on 23 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080724/15246302/attachment.htm From Bowerbird at aol.com Thu Jul 24 17:12:20 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 24 Jul 2008 20:12:20 EDT Subject: [gutvol-d] getting my wikisource bearings Message-ID: john- well, i went to wikisource, and poked around for a little bit. just some general questions and observations for now... i'm not sure i grok the structure of the place quite yet, but i accidentally managed to get to some proofreading interface: > http://en.wikisource.org/wiki/Page:Robertson_Scottish_Gaelic_Dialects0328.png i used the up-arrow on that page to go here: > http://en.wikisource.org/wiki/Index:Scottish_Gaelic_Dialects but i'm not sure how i would navigate to that page otherwise, and i don't see where i can overview the books being-proofed? and do you track how many times a page has been proofed? i'm not talking just about how many times the page was edited, but also how many times it was proofed and no errors found. my methodology for deciding a page is "done" is when a certain number of proofers (e.g., 2) examined it and found no errors... (choose a higher number for a greater probability of no errors.) i believe it's best if this data is exposed to proofers, so those who seek a bigger challenge can choose the more-well-proofed pages (that might harbor the "elusive error"), while proofers who prefer to do lots of easy corrections will choose the "raw o.c.r." pages... i like the "magnifier" you've got on your images, it's very useful. the images themselves are bandwidth-huge (2.5 megs each!). that's the standard wiki editor, isn't it? -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080724/1166a315/attachment.htm From jayvdb at gmail.com Thu Jul 24 20:31:23 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Fri, 25 Jul 2008 13:31:23 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: Message-ID: On Fri, Jul 25, 2008 at 10:12 AM, wrote: > john- > > well, i went to wikisource, and poked around for a little bit. Great, I've been anxiously watching for a user "Bowerbird" to be created. If you have used a different username, please put me out of my misery and let me know, privately if you would prefer. > just some general questions and observations for now... > > i'm not sure i grok the structure of the place quite yet, but > i accidentally managed to get to some proofreading interface: >> >> http://en.wikisource.org/wiki/Page:Robertson_Scottish_Gaelic_Dialects0328.png Note that the page begins with "Page:" - more on that to follow. That page is proofread, and waiting for validation. It is color coded as yellow. If it is correct, click edit at the top, and click the radio button that is color coded green, down the bottom. Then click "Save page". No need to worry about the edit summary field; it will be automatically filled in. > i used the up-arrow on that page to go here: >> http://en.wikisource.org/wiki/Index:Scottish_Gaelic_Dialects > but i'm not sure how i would navigate to that page otherwise, > and i don't see where i can overview the books being-proofed? It has only recently been added, and has so far been only worked on by a single user. That user hasnt added it to the list of transcription projects; I have corrected that now: http://en.wikisource.org/wiki/Wikisource:Transcription_Projects#Projects_needing_to_be_proofread All English transcription projects are automatically categorised into this dynamic page: http://en.wikisource.org/wiki/Category:Index All pages that are yet to be proofread can be found in another dynamic category: http://en.wikisource.org/wiki/Category:Not_proofread > and do you track how many times a page has been proofed? > i'm not talking just about how many times the page was edited, > but also how many times it was proofed and no errors found. We have a two stage process. The page is marked as "proofread" (yellow) when one person has proofed it, and we expect that they have found and fixed all transcription issues. It isnt necessary that layout issues have been resolved yet. A second person comes along and validates (green) that it is indeed an accurate transcription. At the moment, these are tracked by categories: http://en.wikisource.org/wiki/Category:Proofread http://en.wikisource.org/wiki/Category:Validated At the same time, the work may be published. View this page, and click edit. http://en.wikisource.org/wiki/Scottish_Gaelic_Dialects Notice that it only contains references to the transcribed pages, and that it does _not_ contain any prefix (i.e. no "Page:"). This page is a logical layer that has been created on top of the transcription pages. We measure the size of our wiki in number of published pages of this kind. We add markup into the transcription pages, often with much complexity. For example, this set of pagescans... http://en.wikisource.org/wiki/Index:H.R._Rep._No._94-1476 ... is presented as a single authentic page ... http://en.wikisource.org/wiki/Copyright_Law_Revision_(House_Report_No._94-1476) ... and then as an annotated edition, with corrections which can be found at the bottom. http://en.wikisource.org/wiki/Copyright_Law_Revision_(House_Report_No._94-1476)/Annotated > my methodology for deciding a page is "done" is when a certain > number of proofers (e.g., 2) examined it and found no errors... > (choose a higher number for a greater probability of no errors.) our methodology for deciding when a page is done is far more fluid. It stops being a transcription project once the pages are all green. We hope that the proofreaders have worked together, documented their choices, and the result is consistent. For example here is our "proofreading project of the month" (I believe the text came from Project Gutenberg, so this is primarily an small projecct to re-unite the text with a set of pagescans of an identifiable edition). http://en.wikisource.org/wiki/Index:Wind_in_the_Willows_(1913).djvu On the accompanying talk page are notes: http://en.wikisource.org/wiki/Index_talk:Wind_in_the_Willows_(1913).djvu Once we are finished that project, we will then reconstruct an earlier edition (I had hoped we would do this earlier edition first, but ... we've got nothing but time..). http://en.wikisource.org/wiki/Index:Wind_in_the_Willows.djvu > i believe it's best if this data is exposed to proofers, so those who > seek a bigger challenge can choose the more-well-proofed pages > (that might harbor the "elusive error"), while proofers who prefer > to do lots of easy corrections will choose the "raw o.c.r." pages... We believe it is best if the images and associated data are exposed to the _reader_. Every reader is a potential contributor. It is readers who find the most difficult transcription errors, simply because there are more of them, and they are the ones that are shocked. A proofreader is looking for problems, and can easily miss them for lookn at them. A reader isn't expecting problems, and so is rudely awakened from their enjoyable reading when they see an error. Once a transcription project has been completed, the text is still editable so anyone can "value add". For fiction, we discourage semantic markup as it is distracting, however for other types of works, e.g. scientific, biographical, etc., transcription is just the beginning. The process is also not strict. Here is a work that has been semantically marked up prior to being proofread, because the person doing the work is actually more interested in creating Wikipedia biographies for the people named in it, and extracting the images therein. The transcription project is basicly just a way for him to keep track of where he is up to: http://en.wikisource.org/wiki/Index:A_Concise_History_of_the_U.S._Air_Force.djvu I dont recall finding any errors in that transcription as yet, so the red pages should probably have been marked yellow on creation. Not that it matters, because two people verifying it doesnt hurt, and it is interesting to boot. As an example of the fun part of Wikisource proofreading, notice on pagescan 7, there is a link underneath "wreck of bloodstained wood, wire, and canvas". Clicking on that takes the reader to the NYT article the quote comes from, complete with pagescans. http://en.wikisource.org/wiki/Page:A_Concise_History_of_the_U.S._Air_Force.djvu/7 On page 8, we are trying to get our hands something tangible for the quote "Why all this fuss about airplanes for the Army? I thought we already had one." and have located the pagescans for the mentioned appropriations act of March 3, 1911. http://en.wikisource.org/wiki/Page_talk:A_Concise_History_of_the_U.S._Air_Force.djvu/8 Just now I have added this to our todo list. > i like the "magnifier" you've got on your images, it's very useful. > > the images themselves are bandwidth-huge (2.5 megs each!). Our images vary in size, depending on where they came from. We dont have any rules on quality. One of our recent "featured texts" started from a poor res gif, which I uploaded because a website was given a take down order for a collection of PD obituaries. http://en.wikisource.org/wiki/Image:Charles_Babbage_(Obituary%2C_The_Times).gif This text was not transcribed on a "Page:" , as the distinction between physical pages and the logical overlay was not in common use at the time. Here is the published page: http://en.wikisource.org/wiki/The_Times/The_Late_Mr._Charles_Babbage%2C_F.R.S. Others like the idea of featuring this interesting obit, so they hit the stacks and scanned a much higher res image: http://en.wikisource.org/wiki/Image:Obituary_for_Charles_Babbage.png While it was featured, which means it was on the front page, the pages are protected so that only site administrators may edit it. This is mostly to prevent vandalism. As it turned out, the transcription was not 100% correct .. and an anonymous person fixed the error after we removed the protection: http://en.wikisource.org/w/index.php?title=The_Times%2FThe_Late_Mr._Charles_Babbage%2C_F.R.S.&diff=619858&oldid=617551 Any contributor can upload new images, djvu files, ogg files, or even pdf files. If a pdf or a djvu file turns up without an OCR layer, high res. images are very useful to allow someone else to grab them and push them through OCR, often kicking and screaming in the case of unconventional scripts and layouts. This DjVu file, with embedded OCR, can then be uploaded over the top of the original file (the old file is still accessible), and then our bots automatically extract the text layer from the djvu file and create the pages to be proofread. > that's the standard wiki editor, isn't it? It has been developed for Wikisource. -- John From Bowerbird at aol.com Thu Jul 24 23:35:19 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Jul 2008 02:35:19 EDT Subject: [gutvol-d] getting my wikisource bearings Message-ID: john said: > I've been anxiously watching for a user "Bowerbird" to be created. i just did, so you could stop being anxious... :+) > All English transcription projects are automatically > categorised into this dynamic page: > http://en.wikisource.org/wiki/Category:Index ok, great, thank you. > All pages that are yet to be proofread can be found > in another dynamic category: > http://en.wikisource.org/wiki/Category:Not_proofread got it... > We have a two stage process.? > The page is marked as "proofread" (yellow) > when one person has proofed it, and we expect > that they have found and fixed all transcription issues. yes, i read up on that since. thanks for all the other information you provided. if i have any more questions, i will let you know... i expect it will take me several weeks to get up to speed with enough knowledge of your system to ask questions that might be a reflection on the quality of your workflow, but if anything's up in the interim, i'll let you know that too. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/7e013d29/attachment-0001.htm From ralf at ark.in-berlin.de Fri Jul 25 01:46:06 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Fri, 25 Jul 2008 10:46:06 +0200 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: Message-ID: <20080725084606.GB1589@ark.in-berlin.de> You wrote > ... It is readers > who find the most difficult transcription errors, simply because there > are more of them, and they are the ones that are shocked. A > proofreader is looking for problems, and can easily miss them for > lookn at them. A reader isn't expecting problems, and so is rudely > awakened from their enjoyable reading when they see an error. The only time I've seen such cynicism was with companies releasing beta software to minimize the costs of testing. The fundamental difference, however, between proofing transcriptions and software testing is that software is never finished. Of course, the two-step proofing of Wiksource will leave lots of errors for its two-stepness but also because of the clumsiness of the proofing interface when compared to PGDP. Result is significantly lower quality than DP-released books. Sorely missing are also print and other format versions of the texts. Regards, ralf From jayvdb at gmail.com Fri Jul 25 04:43:15 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Fri, 25 Jul 2008 21:43:15 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: <20080725084606.GB1589@ark.in-berlin.de> References: <20080725084606.GB1589@ark.in-berlin.de> Message-ID: On Fri, Jul 25, 2008 at 6:46 PM, Ralf Stephan wrote: > You wrote >> ... It is readers >> who find the most difficult transcription errors, simply because there >> are more of them, and they are the ones that are shocked. A >> proofreader is looking for problems, and can easily miss them for >> lookn at them. A reader isn't expecting problems, and so is rudely >> awakened from their enjoyable reading when they see an error. > > The only time I've seen such cynicism was with companies releasing > beta software to minimize the costs of testing. The fundamental > difference, however, between proofing transcriptions and software > testing is that software is never finished. It isnt cynicism to be pragmatic. I am talking about opportunity cost and diminishing returns. Many early Project Gutenberg etexts are riddled with errors, have paragraphs missing, dont include images that are integral to the work, etc, etc. Thank goodness PG put them out anyway, and either readers or other reviewers identified problems and provided corrections. There are many errors in those etexts to this day, but they are still useful. English Wikisource works on the same premise as PG, open source, Wikipedia, etc. Near enough is usually good enough, and if someone really wants better, they'll have the motivation and/or funding to fix it. This is often inappropriate, where high availability is required, or the client of the output is paying big bucks. Then you can put a good QA team onto the task, and find every problem, in order to decide whether it needs to be fixed. When proofreaders are volunteers, it is best to let them go only as far as they are motivated to go. This is at odds with the running to the "finish" line mentality, which forces many sets of eyes to review a work in order to be 100% sure it is finished. The problem with this approach is that those reviewers arn't having fun. They are looking for errors, but if they have found none in the last 10 pages, in the back of their mind they are pretty sure there are none - or, they know that it is going to be damn hard to find it, so they are not happy to be looking for the needle in the haystack. The English Wikisource model differs slightly from the "finish line" approach because the text is immediately tossed into the public eye on the web, with quality indicators to alert the reader to the status of the text. Being a wiki, it is hard wired into the system that we expect gradual improvements to occur. As a result, the English Wikisource is not pushing texts across the finish line quite so hard, and readers are pointing out the odd error they find. This is pragmatic because we are small, and pragmatic because readers are not being told it is finished - the reader is hopefully fully aware that a wiki _can_ be full of rubbish - Wikipedia should have taught everyone that by now. Our approach in this may change in time, and the German Wikisource project is run much more like PGDP, so maybe there is hope yet for the English project. (as an aside, I only know the English and Latin Wikisource projects well; I know the practises of the French or German Wikisource projects to be quite different, and I am sure that other smaller language projects have their own policies and models. ) Personally, I like the instant gratification of putting online a 99% accurate transcription. I also like to verify the work of others, but I cant stomach much of it - the eyes blur over and I start thinking about the weekend. My hat is off to those that have been doing this painstaking work. > Of course, the two-step proofing of Wiksource will leave lots of > errors for its two-stepness but also because of the clumsiness of > the proofing interface when compared to PGDP. Result is significantly > lower quality than DP-released books. Sorely missing are also > print and other format versions of the texts. Please explain where our interface is clumsy in comparison to PGDP. It might just be that I havent explained something, or we have come at the problem from a different angle. Also if you consider the two-step process to be inadequate, I'd like to hear more. We can add as many steps as are necessary. The workflow diagram is currently very simple, but we have found it effective so far. A "pre-processed" stage could be a good addition, but it doesnt really say a lot, as further pre-processing might be possible. Automations can mark a page as "problematic" if they can detect any errors that cant be fixed. We do publish immediately to web, we prefer to present works as a single page if possible so print is easier, and we havent tackled other formats, yet. At present we are focusing on the transcription interface. On English Wikisource, we havent "released" very many works, so it is a bit to early to do a comparison on that front. Here is our list of "featured texts", which are increasing in quality over time, a process which has caused us to not proceed with the monthly cycle at times due to priorities. http://en.wikisource.org/wiki/Wikisource:Featured_texts -- John From ralf at ark.in-berlin.de Fri Jul 25 08:45:05 2008 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Fri, 25 Jul 2008 17:45:05 +0200 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: <20080725084606.GB1589@ark.in-berlin.de> Message-ID: <20080725154505.GA2340@ark.in-berlin.de> You wrote > It isnt cynicism to be pragmatic. I am talking about opportunity cost > and diminishing returns. Many early Project Gutenberg etexts are > riddled with errors, have paragraphs missing, dont include images that > are integral to the work, etc, etc. Thank goodness PG put them out > anyway, Thank goodness? It's those texts that frequently draw criticism with reviews on manybooks, for example. Also, since there are so many important works without transcription, why don't you concentrate on volume and just publish the OCR like UMich with their scans accompanied by the OCR? > Our approach in this may change in time, and the German Wikisource > project is run much more like PGDP, so maybe there is hope yet for the > English project. I don't think so when I look at the results. > Please explain where our interface is clumsy in comparison to PGDP. With DP, you have both the scan and the text on screen. The spellcheck needs one click and presents errors as marked and ready for correction; good/bad word lists per language exist. Different fonts are available at once. No risk of correction conflict. Listing of your diffs *per project* possible. Browsing of scans with one click (prev/next). Possibility of chosing difficulty of work (easy/normal/hard projects as well as P123/F12). Support of TEI master producing all other formats (TXT, HTML, PDF) automatically. Possibility of LaTeX-only projects. > Also if you consider the two-step > process to be inadequate, I'd like to hear more. With P123/F12/PP/PPV, DP has a seven-step process. Although later rounds aren't required to do proofing, glaring errors are certainly corrected in later rounds, too. > We can add as many steps as are necessary. But you won't. > We do publish immediately to web But what people call now 'the usual Gutenberg quality' will be achieved only much later, when eBook producers will have forgotten about the link. At that time, say ten years, you'll still pounce on the old horse of lower quality of <10k etexts. We'll be at 50k(?), then. ralf From hart at pglaf.org Fri Jul 25 08:46:23 2008 From: hart at pglaf.org (Michael Hart) Date: Fri, 25 Jul 2008 08:46:23 -0700 (PDT) Subject: [gutvol-d] Who Needs a Library When You Have an iPhone Message-ID: http://ct.bnet.com/clicks?t=70496868-2e256fe64ecacf4b8cba0b6fdb65369a-bf&brand=BNET&s=5">Who Needs a Library When You Have an iPhone? BNET's Rick Broida on how to turn your iPhone into the ultimate e-book reader. From walter.van.holst at xs4all.nl Fri Jul 25 08:50:49 2008 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Fri, 25 Jul 2008 17:50:49 +0200 Subject: [gutvol-d] Who Needs a Library When You Have an iPhone In-Reply-To: References: Message-ID: <4889F659.2040500@xs4all.nl> Michael Hart wrote: > http://ct.bnet.com/clicks?t=70496868-2e256fe64ecacf4b8cba0b6fdb65369a-bf&brand=BNET&s=5">Who > Needs a Library When You Have an iPhone? BNET's > > > Rick Broida on how to turn your iPhone into the ultimate e-book > reader. Having both an iRex iLiad and an iPhone, I prefer the iLiad or my e-book purposes. To each his own, I guess. Regards, Walter (and yes, I am a bit of a gadget-addict) From Bowerbird at aol.com Fri Jul 25 10:33:26 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Jul 2008 13:33:26 EDT Subject: [gutvol-d] Who Needs a Library When You Have an iPhone Message-ID: walter said: > Having both an iRex iLiad and an iPhone, > I prefer the iLiad or my e-book purposes. > To each his own, I guess. do you carry your iphone all the time? and do you carry your iliad all the time? which do you prefer for "e-book purposes" if/when you are only carrying your iphone? finally, which do you prefer for "phone purposes"? :+) there's a good reason generalized devices usually win in the marketplace long-term... don't get me wrong. i'm sure the huge screen on the iliad is _a_sheer_pleasure_ to read from. but unless you carry a purse, it's also a bother... (not to mention what the darn thing _costs_!) -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/f0f9b5f4/attachment.htm From Bowerbird at aol.com Fri Jul 25 10:56:02 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Jul 2008 13:56:02 EDT Subject: [gutvol-d] getting my wikisource bearings Message-ID: john- please ignore the ralf-baiting. the d.p. echo-chamber has convinced itself that its product is quite superior... *** i said: > if anything's up in the interim, i'll let you know that too. it did just so happen that something jumped out right away. you're rewrapping the text before it goes in front of people -- removing the original linebreaks from the print book -- which makes the text _exceedingly_ more difficult to proof... if you wanna know the one thing you're doing wrong, this is it. i would estimate it cuts proofing efficiency in half, or _more_. moreover, it makes it much less pleasurable to do. lose-lose. for a look at a system that retains the p-book linebreaks, see: > http://z-m-l.com/go/mabie/mabiep123.html > http://z-m-l.com/go/myant/myantp123.html > http://z-m-l.com/go/sgfhb/sgfhbp123.html any improvements i'd suggest after this will pale in comparison to the increase you will get by bringing back original linebreaks. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/7f37e63f/attachment.htm From sly at victoria.tc.ca Fri Jul 25 12:12:44 2008 From: sly at victoria.tc.ca (Andrew Sly) Date: Fri, 25 Jul 2008 12:12:44 -0700 (PDT) Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: <20080725154505.GA2340@ark.in-berlin.de> References: <20080725084606.GB1589@ark.in-berlin.de> <20080725154505.GA2340@ark.in-berlin.de> Message-ID: _Pace_ Ralph and John. As this is a project gutenberg mailing list, it might be good to remember that Greb Newby and Michael Hart have affirmed many times that PG wishes to encourage anyone and everyone interested in the goals of digitizing and preserving texts, regardless of methods used. (Kind of "the more the merrier" approach.) PGDP and wikisource have come from different backgrounds, and I'm sure there is room for improvement in both. (There always is.) I'm also sure that participants in each could learn something from exploring the other. Right away a difference that I see is that wikisource appears to deal more with shorter texts, ie, single poems, historically important letters, patents, codes of law, etc. Andrew From Bowerbird at aol.com Fri Jul 25 15:41:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Jul 2008 18:41:03 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 024 Message-ID: 24. search for all lines with a comma followed by non-whitespace, except the following cases, which are all accepted to be allowed: > comma-doublequote-whitespace > comma-singlequote-whitespace > comma-emdash > comma-doublequote-emdash > comma-singlequote-emdash this routine returned these 7 cases: > Gordon's lips formed a silent exclamation.,. > They glanced,-each at the other, swiftly; it > girl until--until Buckley.,. until to-night, now. > darkly, Gordon, stood still, Meta Beggs fe.ll be,- > It enraged him that she was so collected; her body,* > to its goal,., Gordon saw now that Mrs. Caley > your wife. Miss Beggs oughtn't.,. she isn't anything 4 of them (#1, #3, #6, and #7) were misrecognized ellipses, #2 and #4 were specks that were misrecognized as a dash, and #5 had a speck that was misrecognized as an asterisk... 7 more lines corrected, for a grand total of 208, on 24 routines... i'll be back tomorrow with the next tip in this series... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/c4acd089/attachment.htm From Bowerbird at aol.com Fri Jul 25 16:03:48 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 25 Jul 2008 19:03:48 EDT Subject: [gutvol-d] woman in her own right -- 003 Message-ID: continuing on with the "woman in her own right" book... today we observe that roger frank has learned from us. didn't bother to say "thanks", but we don't really need it. appended we show 44 lines that roger's program flagged, and when we look at _why_ they were flagged, we see that roger has incorporated some of our checks into his app... for instance, he is now flagging lines that start with a space. he's also now flagging lines that start with a period, and also flagged are lines that start with a spacey quotemark... and finally, he's now flagging "paragraphs" that start with a lowercase letter, which are typically broken paragraphs. none of these types of glitches were flagged in "mountain blood", so that means roger has added the checks in recently, obviously as a result of my series. so i'm glad that he was paying attention. evidently, he was able to pick up on the hints i offered via _posts_, even though he went on to "accuse" me of _not_sharing_. funny... *** it's still the wrong approach, though, to _flag_ stuff for proofers. these easily-located errors should be fixed by a preprocessor, on a whole-book-wide basis, as a regular part of the workflow, using a dedicated tool that will help streamline the procedure... inserting characters _in_ the text -- in the form of tilda flags -- which must then be _removed_ by the proofers is kind of stupid. both the insertion and the removal take _energy_ to accomplish, so it's just a make-work scenario. we want to _minimize_ work... it's also the case that the "flag-removal" process _causes_errors_. i forget which book i was looking at -- perhaps "the crevice" -- and the "incorrect corrections" on tilda-flagged spacey-quotes was astronomical. that is, instead of making it a close-quote (which is what it actually was) by attaching the quote to the word before it, the proofers made it an open-quote on the word after. (or vice versa, where they made an open-quote a close-quote.) now, proofers almost _never_ get spacey-quotes wrong like that, not when the spacey-quotes aren't flagged. but these tilda-flags having to be removed made the proofers goof up the job big-time. i've never seen more "incorrect corrections" done on spacey-quotes. indeed, i've never seen so many "incorrect corrections" on anything! i'll be able to present you with some hard numbers on this later, when i cover whatever book it was that manifested this problem; for now, just store a little nugget about a problem with flagging. appended are the 44 lines which _could_ have been preprocessed, bringing our grand total of found-but-not-fixed errors to _132_... instead now the proofers will have to make those corrections... -bowerbird *** lines that start with a space ~ The remaining member of the party was Montecute ~ Many Prominent Persons Among the Creditors. *** lines that start with a period ~.to his lips--and, then, without a word, swung ~.go by water to Baltimore (which was available on ~.PIRATE'S GOLD 151 ~. . . Thank you! Now, you may arise and shake ~.said Croyden. ~.Kneeling, he quickly dug with a small trowel a hole *** lines that start with a quotemark followed by a space ~" she asked presently. "He appeared perfectly ~" asked Croyden. ~" picking up a pearl stud from under the ~" said Macloud. ~" she asked. ~' until the women are safely returned. They *** "paragraphs" starting with a lowercase letter ~s. w. c. ~there were a dozen white men, with slouch hats ~mean it isn't there? ~" he exclaimed. ~the noise of the team. ~whom have we here? ~" as a buggy emerged from ~has drawn her robes about her----" ~a very sweet girl, needs no proof--unless----" ~looking at her with a meaning smile. ~enigmatically. "I want you----" She put one ~slender foot on the fender, and gazed at it, meditatively, ~shall we go this very evening?" ~be correct. So, why? Why?----" She held up ~her hand. "Don't answer! I'm not asking for ~head----" ~years, and I have never before known him to exhibit ~now----" He walked across to the window. He ~would let that sink in.--" How's the Symphony in ~ended. ~her?" ~an expressive gesture, he resumed the ascent. ~is Elaine's," said he. "I recognize the monogram ~than ten thousand cents. I am only----" She ~stopped, staring. ~think----" ~tell you the entire story.............Is ~there anything I have missed? ~" he ended. ~at him the while. ~to my affection for Elaine, it's vanished, now.---- ~his coat......" Oh! I forgot to say, I ~wired the Pinkerton man to recover the package ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/30c56f5c/attachment.htm From prosfilaes at gmail.com Fri Jul 25 16:56:50 2008 From: prosfilaes at gmail.com (David Starner) Date: Fri, 25 Jul 2008 19:56:50 -0400 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: Message-ID: <6d99d1fd0807251656y3ae300f8r73a5df0adb9f05de@mail.gmail.com> On Thu, Jul 24, 2008 at 11:31 PM, John Vandenberg wrote: > It is readers > who find the most difficult transcription errors, simply because there > are more of them, and they are the ones that are shocked. A > proofreader is looking for problems, and can easily miss them for > lookn at them. A reader isn't expecting problems, and so is rudely > awakened from their enjoyable reading when they see an error. You state that as fact; any statistics? I think there's a lot of errors that a proofreader will almost invariably catch that readers won't. For example: > There was mild laughter, but Foster went into paroxysms. He slapped his thighs and shook his > head. He always laughed heartily at humor at his own expense. That established the point that he > could "take it," and give him license to "dish it out." What's the error in that paragraph? I don't think any reader would catch it. Any proofreader, with even the vaguest attention to the scan, would see that missing sentence in a heartbeat. Readers miss subtle substitutions (did you catch the fact I changed jokes to humor in the above text?) and tend to hypercorrect things to match their own perception of right; how many changes at Wikisource are to correct misspellings like humour and colour? Over the long run, there are sets of errors that an unlimited supply of readers will eventually catch; but I think that there are many errors that will never be caught except by comparing with scans or another etext, and that frequently for minor texts the number of readers a text accumulates will fail to do as good a job as the concerted series of proofreaders DP does. (In some cases, I think DP has more proofreaders than will ever read the book as an etext; at least it means that all of the books DP puts into PG pass basic quality standards.) From jayvdb at gmail.com Fri Jul 25 20:20:34 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Sat, 26 Jul 2008 13:20:34 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: Message-ID: On Sat, Jul 26, 2008 at 3:56 AM, wrote: > john- > > please ignore the ralf-baiting. the d.p. echo-chamber > has convinced itself that its product is quite superior... > > *** > > i said: >> if anything's up in the interim, i'll let you know that too. > > it did just so happen that something jumped out right away. > > you're rewrapping the text before it goes in front of people > -- removing the original linebreaks from the print book -- > which makes the text _exceedingly_ more difficult to proof... > > if you wanna know the one thing you're doing wrong, this is it. Interesting point. We do re-wrap the "view" display. Line breaks in the actual data are being dropped in the output, so that a paragraph is reading for reading. This is an example of how our imperative to "publish" (no wiki page is unpublished; it's live immediately, for good or ill) has meant that we have prioritised the final published view over the proofreading interface. I think we can do better, with a minor tweak to the display engine. I've made the proposal here: http://en.wikisource.org/wiki/WS:S#Adding_purge_link_to_Index_pages In the actual data, the line breaks are not being intentionally dropped. When we populate a transcription project from online transcriptions, the original line breaks are usually long gone. If the proofreading problem discussed above is fixed, then it would make sense that we would restore the line breaks before we finish a project. When we load OCR into system, the line breaks are retained in the editing window. If the browser window is too small, the web browser wraps the long lines: http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit > i would estimate it cuts proofing efficiency in half, or _more_. > moreover, it makes it much less pleasurable to do. lose-lose. > > for a look at a system that retains the p-book linebreaks, see: >> http://z-m-l.com/go/mabie/mabiep123.html >> http://z-m-l.com/go/myant/myantp123.html >> http://z-m-l.com/go/sgfhb/sgfhbp123.html These look good, and the zml format looks easy to import into Wikisource, except that the images are separate rather than bundled into a djvu. I'd like to write a zml importer. Is there any specific work you would be interested in seeing on Wikisource. Perhaps one you have already pre-processed, but is not yet proofread? Thanks for taking a look. -- John From Bowerbird at aol.com Fri Jul 25 21:21:42 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 26 Jul 2008 00:21:42 EDT Subject: [gutvol-d] getting my wikisource bearings Message-ID: john said: > In the actual data, the line breaks are not being intentionally dropped. i take it that you mean they are intentionally retained?... :+) is there a way that a person can get to this "actual data"? > When we populate a transcription project from online transcriptions, > the original line breaks are usually long gone. usually. but sometimes if you "view source" on the .html, you will find that they are still there, in the .html source... > If the proofreading problem discussed above is fixed, then > it would make sense that we would restore the line breaks > before we finish a project. i've tried that -- it's a lot of work... so much work that it's faster and simpler to re-do the o.c.r. and apply the corrections to that. that's why it's so darn troublesome when a digitizer rewraps text. (although now that i've gone and said that, i suppose that i could write a tool that would simplify that specific task. i'd have to try.) > When we load OCR into system, the line breaks are > retained in the editing window.? If the browser window > is too small, the web browser wraps the long lines: but, if i've understood you correctly, doing proofreading in the editing window has problems too, including obtrusive markup. *** we should tie this into some ramifications, too, to be complete. i believe that it's important to tie an e-text to its "ur" paper-book. but it's not enough to simply _do_ that, you have to also make it _obvious_ and _easily_verifiable_ -- by anyone who cares to look. retaining the linebreaks is the one thing you can do to make it easy. it's just too difficult to verify that the text is the same as the scan when the linebreaks in the text have been removed. so, given two sources of an e-text, the future will gravitate toward the one that kept the original linebreaks, as more-easily verified... there are other ramifications as well -- verisimilitude in printing, for instance -- but easy verifiability is _the_ most important one. (even more important than easier proofreading, in the long run, but since they both point to keeping the linebreaks, no conflict.) *** > If the browser window is too small, the web browser wraps the long lines: i've got a cinema-screen, so i can make the browser window "big". but if i couldn't, i'd probably start resenting that column on the left. i'd want to get rid of as much chrome as possible. just text and scan. > http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit uueeww. i'm quite sure i didn't want to look at _that_ page of o.c.r. i'll just pretend i didn't see it for now... ;+) > I'd like to write a zml importer. if you want to. but i can probably write it more easily than you can. i've already written zml-to-html and zml-to-pdf conversion tools... but what i'd _most_ like to do is streamline wiki-markup to be more zen. > Is there any specific work you would be interested > in seeing on Wikisource. Perhaps one you have > already pre-processed, but is not yet proofread? none right now, no. but let me think a little bit on that. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/d37733ce/attachment-0001.htm From dakretz at gmail.com Fri Jul 25 22:03:59 2008 From: dakretz at gmail.com (don kretz) Date: Fri, 25 Jul 2008 22:03:59 -0700 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34 In-Reply-To: References: Message-ID: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com> John, Let me attire myself with my technical naivete cap, and ask a question with a probably obvious answer. What's with this djvu? >From googling around, it appears to be a pdf alternative. But is seems to be strongly preferred in wikiland. Why? And if it's equivalent, why isn't it supported interchangeably with pdf files, which we all know how to build and deconstruct? On Fri, Jul 25, 2008 at 9:21 PM, wrote: > Send gutvol-d mailing list submissions to > gutvol-d at lists.pglaf.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.pglaf.org/listinfo.cgi/gutvol-d > or, via email, send a message with subject or body 'help' to > gutvol-d-request at lists.pglaf.org > > You can reach the person managing the list at > gutvol-d-owner at lists.pglaf.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gutvol-d digest..." > > Today's Topics: > > 1. Re: getting my wikisource bearings (Andrew Sly) > 2. how to clean up ("preprocess") the o.c.r. for a book -- 024 > (Bowerbird at aol.com) > 3. woman in her own right -- 003 (Bowerbird at aol.com) > 4. Re: getting my wikisource bearings (David Starner) > 5. Re: getting my wikisource bearings (John Vandenberg) > 6. Re: getting my wikisource bearings (Bowerbird at aol.com) > > > ---------- Forwarded message ---------- > From: Andrew Sly > To: Project Gutenberg Volunteer Discussion > Date: Fri, 25 Jul 2008 12:12:44 -0700 (PDT) > Subject: Re: [gutvol-d] getting my wikisource bearings > > _Pace_ Ralph and John. > > > As this is a project gutenberg mailing list, it might be > good to remember that Greb Newby and Michael Hart have > affirmed many times that PG wishes to encourage anyone > and everyone interested in the goals of digitizing > and preserving texts, regardless of methods used. > (Kind of "the more the merrier" approach.) > > > PGDP and wikisource have come from different backgrounds, > and I'm sure there is room for improvement in both. > (There always is.) I'm also sure that participants in > each could learn something from exploring the other. > > Right away a difference that I see is that wikisource > appears to deal more with shorter texts, ie, single > poems, historically important letters, patents, > codes of law, etc. > > Andrew > > > > ---------- Forwarded message ---------- > From: Bowerbird at aol.com > To: gutvol-d at lists.pglaf.org, Bowerbird at aol.com > Date: Fri, 25 Jul 2008 18:41:03 EDT > Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- > 024 > 24. search for all lines with a comma followed by non-whitespace, > except the following cases, which are all accepted to be allowed: > > comma-doublequote-whitespace > > comma-singlequote-whitespace > > comma-emdash > > comma-doublequote-emdash > > comma-singlequote-emdash > > this routine returned these 7 cases: > > Gordon's lips formed a silent exclamation.,. > > They glanced,-each at the other, swiftly; it > > girl until--until Buckley.,. until to-night, now. > > darkly, Gordon, stood still, Meta Beggs fe.ll be,- > > It enraged him that she was so collected; her body,* > > to its goal,., Gordon saw now that Mrs. Caley > > your wife. Miss Beggs oughtn't.,. she isn't anything > > 4 of them (#1, #3, #6, and #7) were misrecognized ellipses, > #2 and #4 were specks that were misrecognized as a dash, > and #5 had a speck that was misrecognized as an asterisk... > > 7 more lines corrected, for a grand total of 208, on 24 routines... > > i'll be back tomorrow with the next tip in this series... > > -bowerbird > > > > ************** > Get fantasy football with free live scoring. Sign up for FanHouse Fantasy > Football today. > (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) > > ---------- Forwarded message ---------- > From: Bowerbird at aol.com > To: gutvol-d at lists.pglaf.org, Bowerbird at aol.com > Date: Fri, 25 Jul 2008 19:03:48 EDT > Subject: [gutvol-d] woman in her own right -- 003 > continuing on with the "woman in her own right" book... > > today we observe that roger frank has learned from us. > didn't bother to say "thanks", but we don't really need it. > > appended we show 44 lines that roger's program flagged, > and when we look at _why_ they were flagged, we see that > roger has incorporated some of our checks into his app... > > for instance, he is now flagging lines that start with a space. > > he's also now flagging lines that start with a period, and > also flagged are lines that start with a spacey quotemark... > > and finally, he's now flagging "paragraphs" that start with > a lowercase letter, which are typically broken paragraphs. > > none of these types of glitches were flagged in "mountain blood", > so that means roger has added the checks in recently, obviously > as a result of my series. so i'm glad that he was paying attention. > > evidently, he was able to pick up on the hints i offered via _posts_, > even though he went on to "accuse" me of _not_sharing_. funny... > > *** > > it's still the wrong approach, though, to _flag_ stuff for proofers. > these easily-located errors should be fixed by a preprocessor, > on a whole-book-wide basis, as a regular part of the workflow, > using a dedicated tool that will help streamline the procedure... > > inserting characters _in_ the text -- in the form of tilda flags -- > which must then be _removed_ by the proofers is kind of stupid. > both the insertion and the removal take _energy_ to accomplish, > so it's just a make-work scenario. we want to _minimize_ work... > > it's also the case that the "flag-removal" process _causes_errors_. > > i forget which book i was looking at -- perhaps "the crevice" -- > and the "incorrect corrections" on tilda-flagged spacey-quotes > was astronomical. that is, instead of making it a close-quote > (which is what it actually was) by attaching the quote to the word > before it, the proofers made it an open-quote on the word after. > (or vice versa, where they made an open-quote a close-quote.) > > now, proofers almost _never_ get spacey-quotes wrong like that, > not when the spacey-quotes aren't flagged. but these tilda-flags > having to be removed made the proofers goof up the job big-time. > > i've never seen more "incorrect corrections" done on spacey-quotes. > indeed, i've never seen so many "incorrect corrections" on anything! > > i'll be able to present you with some hard numbers on this later, > when i cover whatever book it was that manifested this problem; > for now, just store a little nugget about a problem with flagging. > > appended are the 44 lines which _could_ have been preprocessed, > bringing our grand total of found-but-not-fixed errors to _132_... > > instead now the proofers will have to make those corrections... > > -bowerbird > > > *** lines that start with a space > ~ The remaining member of the party was Montecute > ~ Many Prominent Persons Among the Creditors. > > > *** lines that start with a period > ~.to his lips--and, then, without a word, swung > ~.go by water to Baltimore (which was available on > ~.PIRATE'S GOLD 151 > ~. . . Thank you! Now, you may arise and shake > ~.said Croyden. > ~.Kneeling, he quickly dug with a small trowel a hole > > > *** lines that start with a quotemark followed by a space > ~" she asked presently. "He appeared perfectly > ~" asked Croyden. > ~" picking up a pearl stud from under the > ~" said Macloud. > ~" she asked. > ~' until the women are safely returned. They > > > *** "paragraphs" starting with a lowercase letter > ~s. w. c. > ~there were a dozen white men, with slouch hats > ~mean it isn't there? ~" he exclaimed. > ~the noise of the team. > ~whom have we here? ~" as a buggy emerged from > ~has drawn her robes about her----" > ~a very sweet girl, needs no proof--unless----" > ~looking at her with a meaning smile. > ~enigmatically. "I want you----" She put one > ~slender foot on the fender, and gazed at it, meditatively, > ~shall we go this very evening?" > ~be correct. So, why? Why?----" She held up > ~her hand. "Don't answer! I'm not asking for > ~head----" > ~years, and I have never before known him to exhibit > ~now----" He walked across to the window. He > ~would let that sink in.--" How's the Symphony in > ~ended. > ~her?" > ~an expressive gesture, he resumed the ascent. > ~is Elaine's," said he. "I recognize the monogram > ~than ten thousand cents. I am only----" She > ~stopped, staring. > ~think----" > ~tell you the entire story.............Is > ~there anything I have missed? ~" he ended. > ~at him the while. > ~to my affection for Elaine, it's vanished, now.---- > ~his coat......" Oh! I forgot to say, I > ~wired the Pinkerton man to recover the package > > > > ************** > Get fantasy football with free live scoring. Sign up for FanHouse Fantasy > Football today. > (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) > > ---------- Forwarded message ---------- > From: "David Starner" > To: "Project Gutenberg Volunteer Discussion" > Date: Fri, 25 Jul 2008 19:56:50 -0400 > Subject: Re: [gutvol-d] getting my wikisource bearings > On Thu, Jul 24, 2008 at 11:31 PM, John Vandenberg > wrote: > > It is readers > > who find the most difficult transcription errors, simply because there > > are more of them, and they are the ones that are shocked. A > > proofreader is looking for problems, and can easily miss them for > > lookn at them. A reader isn't expecting problems, and so is rudely > > awakened from their enjoyable reading when they see an error. > > You state that as fact; any statistics? I think there's a lot of > errors that a proofreader will almost invariably catch that readers > won't. For example: > > > There was mild laughter, but Foster went into paroxysms. He slapped his > thighs and shook his > > head. He always laughed heartily at humor at his own expense. That > established the point that he > > could "take it," and give him license to "dish it out." > > What's the error in that paragraph? I don't think any reader would > catch it. Any proofreader, with even the vaguest attention to the > scan, would see that missing sentence in a heartbeat. Readers miss > subtle substitutions (did you catch the fact I changed jokes to humor > in the above text?) and tend to hypercorrect things to match their own > perception of right; how many changes at Wikisource are to correct > misspellings like humour and colour? Over the long run, there are sets > of errors that an unlimited supply of readers will eventually catch; > but I think that there are many errors that will never be caught > except by comparing with scans or another etext, and that frequently > for minor texts the number of readers a text accumulates will fail to > do as good a job as the concerted series of proofreaders DP does. (In > some cases, I think DP has more proofreaders than will ever read the > book as an etext; at least it means that all of the books DP puts into > PG pass basic quality standards.) > > > > ---------- Forwarded message ---------- > From: "John Vandenberg" > To: "Project Gutenberg Volunteer Discussion" > Date: Sat, 26 Jul 2008 13:20:34 +1000 > Subject: Re: [gutvol-d] getting my wikisource bearings > On Sat, Jul 26, 2008 at 3:56 AM, wrote: > > john- > > > > please ignore the ralf-baiting. the d.p. echo-chamber > > has convinced itself that its product is quite superior... > > > > *** > > > > i said: > >> if anything's up in the interim, i'll let you know that too. > > > > it did just so happen that something jumped out right away. > > > > you're rewrapping the text before it goes in front of people > > -- removing the original linebreaks from the print book -- > > which makes the text _exceedingly_ more difficult to proof... > > > > if you wanna know the one thing you're doing wrong, this is it. > > Interesting point. We do re-wrap the "view" display. Line breaks in > the actual data are being dropped in the output, so that a paragraph > is reading for reading. This is an example of how our imperative to > "publish" (no wiki page is unpublished; it's live immediately, for > good or ill) has meant that we have prioritised the final published > view over the proofreading interface. I think we can do better, with > a minor tweak to the display engine. I've made the proposal here: > > http://en.wikisource.org/wiki/WS:S#Adding_purge_link_to_Index_pages > > In the actual data, the line breaks are not being intentionally dropped. > > When we populate a transcription project from online transcriptions, > the original line breaks are usually long gone. If the proofreading > problem discussed above is fixed, then it would make sense that we > would restore the line breaks before we finish a project. > > When we load OCR into system, the line breaks are retained in the > editing window. If the browser window is too small, the web browser > wraps the long lines: > > > http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit > > > i would estimate it cuts proofing efficiency in half, or _more_. > > moreover, it makes it much less pleasurable to do. lose-lose. > > > > for a look at a system that retains the p-book linebreaks, see: > >> http://z-m-l.com/go/mabie/mabiep123.html > >> http://z-m-l.com/go/myant/myantp123.html > >> http://z-m-l.com/go/sgfhb/sgfhbp123.html > > These look good, and the zml format looks easy to import into > Wikisource, except that the images are separate rather than bundled > into a djvu. I'd like to write a zml importer. Is there any specific > work you would be interested in seeing on Wikisource. Perhaps one you > have already pre-processed, but is not yet proofread? > > Thanks for taking a look. > > -- > John > > > > ---------- Forwarded message ---------- > From: Bowerbird at aol.com > To: gutvol-d at lists.pglaf.org, Bowerbird at aol.com > Date: Sat, 26 Jul 2008 00:21:42 EDT > Subject: Re: [gutvol-d] getting my wikisource bearings > john said: > > In the actual data, the line breaks are not being intentionally > dropped. > > i take it that you mean they are intentionally retained?... :+) > > is there a way that a person can get to this "actual data"? > > > > When we populate a transcription project from online transcriptions, > > the original line breaks are usually long gone. > > usually. but sometimes if you "view source" on the .html, > you will find that they are still there, in the .html source... > > > > If the proofreading problem discussed above is fixed, then > > it would make sense that we would restore the line breaks > > before we finish a project. > > i've tried that -- it's a lot of work... so much work that it's faster > and simpler to re-do the o.c.r. and apply the corrections to that. > that's why it's so darn troublesome when a digitizer rewraps text. > > (although now that i've gone and said that, i suppose that i could > write a tool that would simplify that specific task. i'd have to try.) > > > > When we load OCR into system, the line breaks are > > retained in the editing window. If the browser window > > is too small, the web browser wraps the long lines: > > but, if i've understood you correctly, doing proofreading in the > editing window has problems too, including obtrusive markup. > > *** > > we should tie this into some ramifications, too, to be complete. > > i believe that it's important to tie an e-text to its "ur" paper-book. > but it's not enough to simply _do_ that, you have to also make it > _obvious_ and _easily_verifiable_ -- by anyone who cares to look. > > retaining the linebreaks is the one thing you can do to make it easy. > > it's just too difficult to verify that the text is the same as the scan > when the linebreaks in the text have been removed. > > so, given two sources of an e-text, the future will gravitate toward > the one that kept the original linebreaks, as more-easily verified... > > there are other ramifications as well -- verisimilitude in printing, > for instance -- but easy verifiability is _the_ most important one. > (even more important than easier proofreading, in the long run, > but since they both point to keeping the linebreaks, no conflict.) > > *** > > > If the browser window is too small, the web browser wraps the long > lines: > > i've got a cinema-screen, so i can make the browser window "big". > but if i couldn't, i'd probably start resenting that column on the left. > i'd want to get rid of as much chrome as possible. just text and scan. > > > > > http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit > > uueeww. i'm quite sure i didn't want to look at _that_ page of o.c.r. > i'll just pretend i didn't see it for now... ;+) > > > > I'd like to write a zml importer. > > if you want to. but i can probably write it more easily than you can. > i've already written zml-to-html and zml-to-pdf conversion tools... > > but what i'd _most_ like to do is streamline wiki-markup to be more zen. > > > > Is there any specific work you would be interested > > in seeing on Wikisource. Perhaps one you have > > already pre-processed, but is not yet proofread? > > none right now, no. but let me think a little bit on that. > > -bowerbird > > > > ************** > Get fantasy football with free live scoring. Sign up for FanHouse Fantasy > Football today. > (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/e209656e/attachment-0001.htm From dakretz at gmail.com Fri Jul 25 22:12:55 2008 From: dakretz at gmail.com (don kretz) Date: Fri, 25 Jul 2008 22:12:55 -0700 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34 In-Reply-To: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com> References: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com> Message-ID: <627d59b80807252212i15416ee5j6cb8fa2859a6adb2@mail.gmail.com> Bird, I've got an impending project for EB that's unusually light on the troublesome stuff - tables, math, chemistry... Let me know when you'd like to try something a little more substantial, and take a run at it after I've worked it over with my regexes. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/36a85105/attachment.htm From Bowerbird at aol.com Fri Jul 25 22:24:29 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 26 Jul 2008 01:24:29 EDT Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34 Message-ID: dakretz said: > Let me know when you'd like to try something a little more substantial, > and take a run at it after I've worked it over with my regexes. well... i'm _already_ working on "something a little more substantial" -- and am badly behind on it -- but feel free to point me at anything. although i doubt there will be anything left in it after you're finished. and congrats on getting e.b. out of the rounds! after what... 2 years? but change your darn subject line! :+) and trim those digests! ;+) -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/09bf739b/attachment.htm From jayvdb at gmail.com Fri Jul 25 23:37:04 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Sat, 26 Jul 2008 16:37:04 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: Message-ID: On Sat, Jul 26, 2008 at 2:21 PM, wrote: > john said: >> In the actual data, the line breaks are not being intentionally dropped. > > i take it that you mean they are intentionally retained?... :+) Not really. They are being retained where there is no good reason to drop them. For example, if a word is hyphenated over a line break, the line break is being dropped. > is there a way that a person can get to this "actual data"? Click edit. The bot frameworks always obtain this raw wiki text, rather than the html which is sent to the browser when a page is viewed. >> When we populate a transcription project from online transcriptions, >> the original line breaks are usually long gone. > > usually. but sometimes if you "view source" on the .html, > you will find that they are still there, in the .html source... > >> If the proofreading problem discussed above is fixed, then >> it would make sense that we would restore the line breaks >> before we finish a project. > > i've tried that -- it's a lot of work... so much work that it's faster > and simpler to re-do the o.c.r. and apply the corrections to that. > that's why it's so darn troublesome when a digitizer rewraps text. > > (although now that i've gone and said that, i suppose that i could > write a tool that would simplify that specific task. i'd have to try.) See http://en.wikisource.org/wiki/Index:Nietzsche_the_thinker.djvu This was populated by me copying and pasting the text from another website over a period of 10 hours; it was buried away in a forum somewhere, and had no line breaks. That is what I mean by the line breaks being long gone. I broke it up into pages for proofreading purposes, and we _could_ recommend that it is broken up into lines as a early stage in the process, if that is profitable. I doubt Wikisource would ever demand that people do this, but I guess that depends on the arguments for it. >> When we load OCR into system, the line breaks are >> retained in the editing window. If the browser window >> is too small, the web browser wraps the long lines: > > but, if i've understood you correctly, doing proofreading in the > editing window has problems too, including obtrusive markup. Yup. Hence I raised the problem on the Wikisource discussion board, as the real problem is the viewing interface should be the proofreading interface. > *** > > we should tie this into some ramifications, too, to be complete. > > i believe that it's important to tie an e-text to its "ur" paper-book. > but it's not enough to simply _do_ that, you have to also make it > _obvious_ and _easily_verifiable_ -- by anyone who cares to look. > > retaining the linebreaks is the one thing you can do to make it easy. Agreed. > it's just too difficult to verify that the text is the same as the scan > when the linebreaks in the text have been removed. Agreed. The eye needs visual markers to help it to return to the right spot as it flicks between text and image. > so, given two sources of an e-text, the future will gravitate toward > the one that kept the original linebreaks, as more-easily verified... > > there are other ramifications as well -- verisimilitude in printing, > for instance -- but easy verifiability is _the_ most important one. > (even more important than easier proofreading, in the long run, > but since they both point to keeping the linebreaks, no conflict.) > > *** > >> If the browser window is too small, the web browser wraps the long >> lines: > > i've got a cinema-screen, so i can make the browser window "big". > but if i couldn't, i'd probably start resenting that column on the left. > i'd want to get rid of as much chrome as possible. just text and scan. As you have created a user, you can alter your preferences to pick a different skin. Enjoy. >> http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit > > uueeww. i'm quite sure i didn't want to look at _that_ page of o.c.r. > i'll just pretend i didn't see it for now... ;+) But I picked it out especially for you! I'm heart broken. >> I'd like to write a zml importer. > > if you want to. but i can probably write it more easily than you can. > i've already written zml-to-html and zml-to-pdf conversion tools... I'm more interested in seeing the result than in writing the importer. > but what i'd _most_ like to do is streamline wiki-markup to be more zen. The wiki markup is the least malleable piece of the Wikisource system. The wiki syntax is in use in hundreds of gigs of data on the Wikimedia servers, so any improvements need to be backwards compatible, and be thoroughly tested. The devs dont like proposals to change the syntax, and we have recently been through a rewrite of the parser to speed it up and make it conform to an EBNF, so I think they will snarl at anyone who suggests they go back and tweak even the most obvious problem with it. -- John From hyphen at hyphenologist.co.uk Sat Jul 26 01:05:39 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Sat, 26 Jul 2008 09:05:39 +0100 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34 In-Reply-To: <627d59b80807252212i15416ee5j6cb8fa2859a6adb2@mail.gmail.com> References: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com> <627d59b80807252212i15416ee5j6cb8fa2859a6adb2@mail.gmail.com> Message-ID: <000001c8eef6$653e5b10$2fbb1130$@co.uk> don kretz wrote >Bird, >I've got an impending project for EB that's unusually light on the >troublesome stuff - tables, math, chemistry... Let me know when >you'd like to try something a little more substantial, and take a run >at it after I've worked it over with my regexes. The typesetting of ?standards? of pre 1922 books were non existent. The typesetting even in books by a single publisher vary wildly. In the old days typesetting depended on which of the experienced old men actually did the job, or in my case the job was given the new apprentice, who made a mess of it L. Worse the OCR errors produced depends massively on how much ink the printer put on the plates, which varies massively throughout a single book. Worse "my" books contain both prose and poetry dispersed within prose. Not to mention subtitles in crazy fonts which OCR will never get right. Having played with regex on another job. It is my opinion that any regexs for PG will work on anything other than the book for which they were written. Dave Fawthrop -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/7cbdcca5/attachment.htm From tb at baechler.net Sat Jul 26 02:26:24 2008 From: tb at baechler.net (Tony Baechler) Date: Sat, 26 Jul 2008 02:26:24 -0700 Subject: [gutvol-d] Tor books giveaway Message-ID: <488AEDC0.50807@baechler.net> All, The dozen books or so that Tor is giving away are now available on one page for download. This page apparently expires on July 27th, or 27 July. Get them while you can! There is no plain text, but all are in pdf, html, and html zip. Some are in other formats. The wallpapers are also available for download. http://tor.com/index.php?option=com_content&view=blog&id=577 -- ---------- To reply, change slash to dot and remove example from the address. It's left as an exercise to put the rest of the address in the right order. From jayvdb at gmail.com Sat Jul 26 08:51:19 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Sun, 27 Jul 2008 01:51:19 +1000 Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34 In-Reply-To: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com> References: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com> Message-ID: On Sat, Jul 26, 2008 at 3:03 PM, don kretz wrote: > John, > > Let me attire myself with my technical naivete cap, and ask a question with > a probably obvious answer. > > What's with this djvu? http://en.wikipedia.org/wiki/DjVu Some tips on how to construct them are here: http://en.wikisource.org/wiki/Help:Djvu > From googling around, it appears to be a pdf alternative. But is seems to be > strongly preferred in wikiland. Why? And if it's equivalent, why isn't it > supported interchangeably with pdf files, which we all know how to build and > deconstruct? DjVu is a free file format, and the files are smaller for texts. If you look at books on archive.org, the djvu file is always smaller than the PDF. This may be due to the compression routines being used in their PDFs, as they may be not choosing the more sophisticated routines which are covered by patents. http://www.archive.org/details/harperscampingsc00grinrich PDFs are also complex buggers, and encumbered by patents held by Adobe (royalty free use granted to software complying with PDF standard). We do have an Extension for the proofreading system that allows PDFs to be used, however it hasnt been installed. PDFs can be uploaded to Wikisource, and someone will convert them to DjVu. We have a 20MB upload limit, but we are trying to get that lifted. -- John From Bowerbird at aol.com Sat Jul 26 11:29:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 26 Jul 2008 14:29:28 EDT Subject: [gutvol-d] getting my wikisource bearings Message-ID: john said: > They are being retained where there is no good reason to drop them. > For example, if a word is hyphenated over a line break, > the line break is being dropped. ok, well, that's sad. it means no print-out verisimilitude to the p-book. i'm convinced that's gonna be very important down the line, not just for its own sake (which will decrease over time), but for _verification_. already there are so many e-versions of some books floating around that we need a means of sorting 'em out, and this will be a prime one. if you can't be linked to a specific p-book, we'll assume you're bogus. > Click edit. hmm... when i do that for this page: > http://en.wikisource.org/w/index.php?title=Page:Wind_in_the_Willows_%281913%29.djvu/110&action=edit ...the edit-box i get does _not_ have the p-book linebreaks... am i misunderstanding you? > See http://en.wikisource.org/wiki/Index:Nietzsche_the_thinker.djvu > This was populated by me copying and pasting the text > from another website over a period of 10 hours; it was > buried away in a forum somewhere, and had no line breaks.? > That is what I mean by the line breaks being long gone.? right. i know. would've been faster to re-do the o.c.r. (and tell me next time, i can write you a scraping tool.) > I broke it up into pages for proofreading purposes, > and we _could_ recommend that it is broken up into lines > as a early stage in the process, if that is profitable. again, better to re-do the o.c.r., and use that cleaned text in a comparison-merge that makes corrections to the o.c.r. eventually, this is what you'll do with all of the p.g. e-texts -- find the p-book on which they were based, re-do the o.c.r., and then use the proofed p.g. e-text to highlight differences -- this method is rads faster than finding them manually -- so you've got text that is accurate with the p-book linebreaks. then you can toss the p.g. e-text, and regenerate .html/.pdf... > we _could_ recommend that it is broken up into lines > as a early stage in the process, if that is profitable. > I doubt Wikisource would ever demand that people do this, > but I guess that depends on the arguments for it. anyone who has proofed both ways will demand the linebreaks. so eventually you will have no proofers willing to do the other... > the real problem is the viewing interface > should be the proofreading interface. well, i don't know if it'd violate some wikisource philosophy, but it would certainly be _possible_ to have both interfaces available, and let people choose which one they wanted to be in at any time. my stance on this has been that, whenever a book is newly posted, it would be in the proofreading interface only, so end-users _know_ that "this book cannot be considered to be _finished_ at this time", and that "your assistance in reporting errors would be highly valued." after a certain amount of time, or a specific number of read-throughs, the status would change such that people could view it in either mode. and, of course, anyone who had a doubt at any time could switch into proofreader mode to view the scan to determine if the text was correct. > you can alter your preferences to pick a different skin.? Enjoy. ah yes, i need to get my wiki thinking-cap on, i'd not thought of that. (a kind person also pointed out i can find out what links to a wiki-page by clicking the "what links here" button in the toolbox. oh yeah. d'oh.) > But I picked it out especially for you!? I'm heart broken. i never promised you a rose garden... ;+) > I'm more interested in seeing the result than in writing the importer. lazy programmers are the best kind... > The wiki markup is the least malleable piece of the Wikisource system. oh yeah, i know that. just wishing i could float back in time and tell ward that "hey, there is a much better way to do what you're trying to do here." it started out with the best of intentions, but then it grew into something that is almost as complex as the markup that it was intended to replace. and now of course it has hardened into concrete. so i whine in protest... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/47bb5631/attachment-0001.htm From hyphen at hyphenologist.co.uk Sat Jul 26 11:51:43 2008 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Sat, 26 Jul 2008 19:51:43 +0100 Subject: [gutvol-d] Tor books giveaway In-Reply-To: <488AEDC0.50807@baechler.net> References: <488AEDC0.50807@baechler.net> Message-ID: <000601c8ef50$c3e424b0$4bac6e10$@co.uk> Tony Baechler wrote >All, >The dozen books or so that Tor is giving away are now available on one >page for download. This page apparently expires on July 27th, or 27 >July. Get them while you can! There is no plain text, but all are in >pdf, html, and html zip. Some are in other formats. The wallpapers are >also available for download. >http://tor.com/index.php?option=com_content&view=blog&id=577 Thanks for that, got them. As an SF fan I usually like Tor books Dave Fawthrop From Bowerbird at aol.com Sat Jul 26 13:24:05 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 26 Jul 2008 16:24:05 EDT Subject: [gutvol-d] djvu Message-ID: john said: > http://en.wikipedia.org/wiki/DjVu aside from the fact that mac people are third-class citizens in djvu-land... i don't know how to display a specific page from a djvu on a web-page... is there an easy way? i assume that you're doing some back-end tricks to accomplish that? and i assume you can share that? -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/192e2ef2/attachment.htm From Bowerbird at aol.com Sat Jul 26 14:55:22 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 26 Jul 2008 17:55:22 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 025 Message-ID: 25. search for doublequote-whitespace-doublequote as you can see from the 6 hits below, this turns up instances where dialog paragraphs (which _should_ be separate) were run together. > "They get more like rats every year." > "I thought about you, held against your will." > "I thought about you, held against your will." > "Don't tell lies; I went right out of your mind." > "Don't tell lies; I went right out of your mind." > "Not as quick as I went out of yours. I did > be made. Going--" > "Thirty-one hundred," Gordon pronounced > "Well," Gordon responded, "and if I did?" > "I studied over it at first," the other frankly admitted; > over the sere grass. "Scrabble for them in the dirt." > "You c'n throw them away now the railroad's left all of these were fixed by introducing the intervening blank line. 6 more lines corrected, for a grand total of 214, on 25 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/c0c8d618/attachment.htm From Bowerbird at aol.com Sat Jul 26 15:03:21 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 26 Jul 2008 18:03:21 EDT Subject: [gutvol-d] woman in her own right -- 003 Message-ID: continuing on with the "woman in her own right" book... let's do the search i just suggested in my "cleanup" series, namely doublequote-linebreak-doublequote... sure enough, we've got a hit in this book too: > you wish to go?" > "At once!" so our total number of corrections is now 133. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/2932e096/attachment.htm From Bowerbird at aol.com Sat Jul 26 15:06:31 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 26 Jul 2008 18:06:31 EDT Subject: [gutvol-d] woman in her own right -- 004 Message-ID: oops! this is a re-send of a formerly-improperly-numbered post. *** continuing on with the "woman in her own right" book... let's do the search i just suggested in my "cleanup" series, namely doublequote-linebreak-doublequote... sure enough, we've got a hit in this book too: >?? you wish to go?" >?? "At once!" so our total number of corrections is now 133. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/de75f69d/attachment.htm From jayvdb at gmail.com Sat Jul 26 19:03:13 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Sun, 27 Jul 2008 12:03:13 +1000 Subject: [gutvol-d] djvu In-Reply-To: References: Message-ID: On Sun, Jul 27, 2008 at 6:24 AM, wrote: > john said: >> http://en.wikipedia.org/wiki/DjVu > > aside from the fact that mac people are third-class citizens in djvu-land... As far as I can see, Mac OS X is well supported by djvulibre-3, and djvulibre-4 is supposed to be portable - I've not tried it yet. > i don't know how to display a specific page from a djvu on a web-page... > > is there an easy way? the djvulibre package contains the tools to pull out an image from the bundle. > i assume that you're doing some back-end tricks to accomplish that? > and i assume you can share that? Our image host has a naming convention which allows these images to be obtained from the djvu, and to be scaled on the fly. e.g. this is the full image of a document I was working on today. http://en.wikisource.org/wiki/Index:GeorgeTCoker.djvu - the transcription "project" page http://en.wikisource.org/wiki/Image:GeorgeTCoker.djvu - the media description page This is the download URL: http://upload.wikimedia.org/wikipedia/commons/2/22/GeorgeTCoker.djvu Once I know it is in the "2/22" folder, I can then pull an image down by asking for page 10 at 250px http://upload.wikimedia.org/wikipedia/commons/thumb/2/22/GeorgeTCoker.djvu/page10-250px-GeorgeTCoker.djvu.jpg or a png at 500px http://upload.wikimedia.org/wikipedia/commons/thumb/2/22/GeorgeTCoker.djvu/page10-500px-GeorgeTCoker.djvu.png I'm not sure how to ask for a maximum resolution image, but I am guessing it is possible. If you are going to be doing offline work with these bundles of images, and need the maximum resolution, it would be better to install the djvu tools on your machines. -- John From jayvdb at gmail.com Sun Jul 27 07:04:41 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Mon, 28 Jul 2008 00:04:41 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: Message-ID: On Sun, Jul 27, 2008 at 4:29 AM, wrote: > john said: >> They are being retained where there is no good reason to drop them. >> For example, if a word is hyphenated over a line break, >> the line break is being dropped. > > ok, well, that's sad. it means no print-out verisimilitude to the p-book. > > i'm convinced that's gonna be very important down the line, not just > for its own sake (which will decrease over time), but for _verification_. > > already there are so many e-versions of some books floating around > that we need a means of sorting 'em out, and this will be a prime one. > if you can't be linked to a specific p-book, we'll assume you're bogus. > > >> Click edit. > > hmm... > > when i do that for this page: >> >> http://en.wikisource.org/w/index.php?title=Page:Wind_in_the_Willows_%281913%29.djvu/110&action=edit > > ...the edit-box i get does _not_ have the p-book linebreaks... > > am i misunderstanding you? That page was populated from an online source; there were no line breaks. I was almost going to explain why word hyphenation is a bit pointless at the moment for Wikisource, but ... what the heck, I've updated pagescan 110 to split the lines as you would expect them, using two different methods to deal with hyphenation. In the first two cases of hyphenated words, I am using a brand new template "hw" which displays the first and third parameter as joined in the published version, and displays a "-" and a line break in the proofreading view: {{hw|con|- |tinued}} Template:hw points to this, which contains the logic explain above: http://en.wikisource.org/wiki/Template:Hyphenated_word In the second two cases, I am using wiki syntax "" to declare that the hyphen and new line are not desired in the published edition. any- thing Neither mechanism has any affect on the published version of the page: http://en.wikisource.org/wiki/The_Wind_in_the_Willows/Chapter_4 >> See http://en.wikisource.org/wiki/Index:Nietzsche_the_thinker.djvu >> This was populated by me copying and pasting the text >> from another website over a period of 10 hours; it was >> buried away in a forum somewhere, and had no line breaks. >> That is what I mean by the line breaks being long gone. > > right. i know. would've been faster to re-do the o.c.r. > > (and tell me next time, i can write you a scraping tool.) Your response here shocked me. I dont think it would be faster going the OCR route, but I'm going to be paying more attention to this. The 10 hours mentioned above was elapsed, rather than constant hand-breaking work. To give us both a better idea of what I've been doing, I did a little on this volume today: http://en.wikisource.org/wiki/Index:Sacred_Books_of_the_East_42.djvu The first 80 pagescans include an introduction and other front-matter, and there is no online edition, so I uploaded the OCR to be proofread a week or two ago. I then earmarked this as a "repopulate" project. Today, I copied the first few pages of Book 1 from sacred-texts.org, and corrected them. (pagescans 81-). I then googled for the corrected text, to see if there was a better transcription online. I found that ishwar.com had most of the errors found. http://www.ishwar.com/hinduism/holy_atharva_veda/ I cross checked this a few times, and then decided to go with ishwar.com instead of sacred-texts.org. Then for 2 hours I copied the text from ishwar.com to the wikisource pages, resulting in the entire of Book 1 done - text reunited with images. I marked each page as "Proofread" because I am guessing it has been through two sets of eyes prior to mine. I fixed one or two errors as I went. If there are errors the verification phase will have to pick them up. I then documented some post-processing needed to improve the wiki formatting: http://en.wikisource.org/wiki/Index_talk:Sacred_Books_of_the_East_42.djvu Ran a bot over the book 1 pages: python replace.py -file:../sbe42b1.list -regex "\n([IVXL]+), ([0-9]+)\. ([^\n]*)\n" "\n===\1, \2. \3===\n" "\n([0-9]+)\n" "\n\1. \n" "Varuna" "Varu''n''a" "kushtha" "kush''th''a" "vrishas" "v''ri''shas" "gavants" "''g''avants" I also run one of my standard fixes over the book 1 pages to convert "--" to emdash: python replace.py -file:../sbe42b1.list -fix:doubledash You can see both changes here: http://en.wikisource.org/w/index.php?title=Page:Sacred_Books_of_the_East_42.djvu/91&action=history In short, 48 pages of good quality text in ~3 hours. >> I broke it up into pages for proofreading purposes, >> and we _could_ recommend that it is broken up into lines >> as a early stage in the process, if that is profitable. > > again, better to re-do the o.c.r., and use that cleaned text > in a comparison-merge that makes corrections to the o.c.r. Interesting idea. I've been considering using o.c.r. to re-paginate a proofread text. It sounds like you're suggesting the opposite would be more fruitful. > eventually, this is what you'll do with all of the p.g. e-texts -- > find the p-book on which they were based, re-do the o.c.r., > and then use the proofed p.g. e-text to highlight differences > -- this method is rads faster than finding them manually -- > so you've got text that is accurate with the p-book linebreaks. > then you can toss the p.g. e-text, and regenerate .html/.pdf... I'n not sure how often we will be re-populating works using PG etexts, but there are many other transcribed books floating across the internet without pagescans that we need to seriously consider how to make use of the transcription work already done. >> we _could_ recommend that it is broken up into lines >> as a early stage in the process, if that is profitable. >> I doubt Wikisource would ever demand that people do this, >> but I guess that depends on the arguments for it. > > anyone who has proofed both ways will demand the linebreaks. > so eventually you will have no proofers willing to do the other... which means there is no need for rules; on a wiki, common practise cautiously follows best practise. :-) >> the real problem is the viewing interface >> should be the proofreading interface. > > well, i don't know if it'd violate some wikisource philosophy, but > it would certainly be _possible_ to have both interfaces available, > and let people choose which one they wanted to be in at any time. > > my stance on this has been that, whenever a book is newly posted, > it would be in the proofreading interface only, so end-users _know_ > that "this book cannot be considered to be _finished_ at this time", > and that "your assistance in reporting errors would be highly valued." > > after a certain amount of time, or a specific number of read-throughs, > the status would change such that people could view it in either mode. > > and, of course, anyone who had a doubt at any time could switch into > proofreader mode to view the scan to determine if the text was correct. We create a "logical" layout on top of the pagescans, so until someone creates that layer, the work can only be read page-by-page. The pagescan and logical views are linked to each other, so the reader can flick between them: links are provided for logical => pagescan ; the opposite direction requires use of "what links here". http://en.wikisource.org/wiki/Special:WhatLinksHere/Page:Wind_in_the_Willows_%281913%29.djvu/110 -- John From Bowerbird at aol.com Sun Jul 27 11:39:10 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 27 Jul 2008 14:39:10 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 026 Message-ID: 26. search for end-line-hyphenates because d.p. dehyphenates its texts, let's search for hits on lines that end with a single-dash, which should then be automatically dehyphenated with appropriate routines. 40 hits on pagebreaks, requiring 2 edits (one on the last line of the previous page, and one on the first line of the next page). > ghaven lips forever awry in the pronouncing of rally- // [40] > stage driver he was totally without resources, with- // [43] > picking up a pen. "When you bought," he re- // [44] > of vain, sick regrets, his combativeness, his deep- // [48] > elder, vanished brother bullying him; the brief ro- // [50] > of tension, increased. Gordon's only con- // [62] > upland hay to a point within a few miles of his des- // [69] > The doctor greeted him seriously. He had, Gor- / [71] > for the ... Gordon!" she exclaimed more ener- // [76] > throat. The odor of June roses that filled the cor- // [88] > The little affair with Buckley Simmons had cap- // [92] > harsh, lik_{t}e a discordant bell clashing in the soste- // [97] > the old man, through his daughter, ad- // [98] > passing, profound gloom. Then the cloud van- // [103] > stood sharply defined, and enclosed by a fence, flow- // [115] > grew from her palpable liking for him, and was re- // [118] > edge. She was silent, and clung to him with a re- // [123] > elements, to the bitter mountain winters, the ruth- // [129] > hoof-beats of a trotting horse, and he had the feel- / [141] > Ah--" in spite of himself, Valentine Simmons be- // [154] > was more than usually unpropitious; and, discover- // [173] > where, from under a horse blanket, Tol'able pro- // [188] > darkly, Gordon, stood still, Meta Beggs fe.ll be,- // [195] > the astute storekeeper into such a satisfactory, retail- // [213] > slimly rounded, graceful; her hands, like mag- // [226] > something." He leaned across the bed, and, grasp- // [232] > there? He would like to be with her at a sap-boil- // [240] > there. He felt in his pocket the cool, sinuous neck- // [244] > Suddenly it appeared to him in the light of a pos- / [250] > the feminine heart. And they summed up the du- // [257] > cry from within the house was too deep to have pro- // [264] > "I didn't do right," he acknowledged to the trav- / [272] > presented the same order, her white shirt- // [286] > There was a prolonged pause in the bidding, dur- / [289] > before investing such a paramount sum, to com- // [326] > task isolated in the midst of a vast, un- // [333] > beyond; the towering east range bathed in keen sun- // [347] > The horses walked swiftly, almost without guid- / [354] > Stenton stage this phenomenon was highly undesir- // [358] > forward, an uncouth, slipping bulk, under the soar- // [364] 2 of the hits were on broken paragraphs, meaning 3 lines will be edited here: the first line, the incorrect blank line, and the bottom line, which is brought up to the first line. > to the door; it said, "Gone fishing. Back to- // morrow." > ' 'TT'VE got something for you," Gordon said sud- // I denly. *** 86 more lines corrected, for a grand total of 300, on 26 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080727/cc030214/attachment-0001.htm From Bowerbird at aol.com Sun Jul 27 11:47:12 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 27 Jul 2008 14:47:12 EDT Subject: [gutvol-d] djvu Message-ID: john said: > As far as I can see, Mac OS X is well supported by djvulibre-3 yes, in the last year since i last checked, they have brought out some stuff. so one can view a djvu properly inside safari. (but not firefox or camino.) and the offline viewer, which had been flawed (e.g., no facing-pages view), has also been improved, significantly, on a number of different dimensions. > If you are going to be doing offline work with these bundles > of images, and need the maximum resolution, it would > be better to install the djvu tools on your machines. i'll experiment, but i think it'll still work better for my purposes to use the images as separate files. archive.org offers both djvu and the "flippy" images, which is what i've been using up to now. for instance, i still don't know how to pull a single page-image from a djvu for display as part of a webpage, which i need to do. nonetheless, it's great to see the djvu improvements for the mac. thanks for all the information. i'm glad djvu is alive and well... :+) -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080727/5f3fc1a1/attachment.htm From Bowerbird at aol.com Sun Jul 27 11:57:06 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 27 Jul 2008 14:57:06 EDT Subject: [gutvol-d] woman in her own right -- 005 Message-ID: ok, let's check "a woman in her own right" to see if there are any uncorrected end-line hyphenates. oh gee, yes, there are, and quite a few of 'em, too. 128 cases of unfixed end-line hyphenates... > Liabilities of twenty million, assets prob- > "It is good to have you forget yourself occa- > at the Heights, if he had not been Warwick Mat- > of Mattison--let us find something more interest- > possibility of being followed by means of his lug- > usually mean and little, the latter unctuously pre- > "Colonel Duval is dead, however," she added^- > Croyden turned into the walk--the black fol- > "Yass, seh! yass, seh!" the darky answered, in- > hear torectly. An' heah comes Marster Dick, his- > country to you, sir, when compared to Northumber- > at the White Sulphur, where both spent their sum- > regimental guidons--and here his portrait in uni- > man ~" (with a wave of his hand toward the por- > want you to see the furniture, and the family por- > "Have you had any experience with negro ser- > sir," suddenly recollecting himself, "Miss Carring- > He turned, to find Moses in the doorway, wait- > "Dese things not pu'chased. No, seh! Dey's bor- > Duvals have remarked it, in making their endorse- > "I am very glad to meet you, Captain Carring- > to speak on every subject under the sun, Litera- > ture--Bridge--Teaching--Music. Oh, she is in- > "I'm very sorry; I'll try to remember in fu- > Miss Erskine frowned in disapproval and aston- > "Can I come down to-night? Answer to Belle- > "Hum--I see--the aristocracy of birth, not dol- > "So you like it--Hampton, I mean? ~" said Mac- > and pointing to the portraits. "I've got ances- > cars. And there isn't any boat sailing until day- > than the tonnage of the Port of Baltimore, to- > a little ways, now," he added, with the country- > us. Evidently there was none erected here, in Par- > may make me liable to my grantor for an account- > of Midshipmen contains muckers as well as gentle- > "Maybe you left it in Hampton?" said Mac- > "That's my fear," Macloud admitted. "Some- > renounce the opportunity for a half million dol- > "Yes--it has! "he said, after a moment's hesi- > "And is it true that you are seriously em- > Avenue; ~" for a supply of small arms and ammu- > "Is this Senator Rickrose? ~" the Lieutenant in- > Then they took several drinks, and the aide de- > only occasionally, and Greenberry Point seemed un- > North under the needle, ran his eye North-by- > "Then your supposition is that, since Par- > "Mr. Smith, this is Mr. Croyden!" said Hook- > have them all, so I can decide--I want no after- > could be identified. He hoped this was satisfac- > Croyden assured him it was more than satis- > "But we wanted to prove that it couldn't suc- > "To Hampton! "Croyden exclaimed, incredu- > her knees, in the reckless fashion women have now- > "I, naturally, don't ask you to violate any con- > "Not exactly--he is not proclaiming him- > positively pathetic. However, Croyden is not suf- > non-essentials, and does the essentials economic- > speak of your own knowledge, not from his infer- > when contrasted with the brightness of North- > "Yass, seh! her am home, seh, I seed she her- > the world--I repeat it--up to the minute in every- > cobblestones, its drains-in-the-gutters, its how much- > the cost of living, and clog the avenue with auto- > them. And then, when the spectators had de- > "Croyden is my name?" he replied, interro- > entire Point, dragged the waters immediately ad- > "We want an equal divide. We will take Par- > "I was endeavoring to state the matter suc- > quarter of a million and we will forget every- > deceased!" said he, and gave her the let- > isn't the slightest danger of any one being tor- > "I shall believe you, when I see him!" incredu- > "I'd sooner be the present one than all the has- > Presently Croyden came to a large, white en- > "It will put me on ~' easy street,' ~" Croyden ob- > "I don't care to inform them as to my where- > mean that you don't intend to return to North- > "Or of being bound, and gagged, and ill- > Croyden. "I could make a fortune writing fic- > She flung him a look that was delightfully allur- > (observing smiles on Croyden and Miss Carring- > want to see him, either to-night or in the morn- > "And if they are proficient, they go--some- > "After a fashion--we went to Dobbs Ferry to- > The second morning after, when Elaine Caven- > dish's maid brought her breakfast, Miss Carring- > very succinct, very informing, and very satis- > "And it's just as delightful to be able to re- > "----to your going along with me--I'm ex- > known him to have even an affair. He is armor- > doesn't please me, I'm going to talk to Miss Car- > her money,' and shows me scant regard in con- > "It seems so!--even Elaine isn't to be consid- > so humble--you're rather proud of your inter- > "What are you responsible for? ~" asked Mac- > "Nothing! Nothing!--not even for my resolu- > me back, again--and so on, and so on--and so- > "You're in a bad way!" laughed Macloud- > Then he continued with the story he was re- > "Very singular," said the Captain. "Half- > house. They discovered nothing which would ex- > "I do not know--if you will come in, I'll in- > "Something like it? ~" he replied, after a mo- > "I don't know! I'm too angry to know any- > "Hasn't Mr. Croyden told you--or Mr. Mac- > "Then maybe I shouldn't--but I will. Par- > and a glove a short distance from Hamp- > "Hence, a proper choice for our temporary resi- > it supplied the deficiency as best pleased the in- > "It's Parmenter again!" said Croyden, sud- > Parmenter's treasure, but they refuse to be con- > "They have been rather persistent," Macloud re- > solely because of us--to force us to dis- > But why? why? Who are Robert Parmenter's Suc- > "Thank you," he said. "I've been sort of un- > possible handicaps due to ignorance or inad- > to the Parmenter jewels, and all that it con- > as much concerned for success as I am," said Croy- > released. We are going to pay the amount de- > "Going to pay the two hundred thousand dol- > "You sent for me, Miss Cavendish?" he in- > "That remains to be seen, as I have also in- > "You're thinking of paying it? ~" he asked, in- > "I always carry a few blank checks in my hand- > "Thank God! you're not Elaine! "Croyden re- > Carrington, and her love for you," Croyden com- > half smothered. "My hair, dear,--do be care- > "In remembrance of your release, and of Par- so our total number of undone corrections is now 261. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080727/c9ed4253/attachment.htm From schultzk at uni-trier.de Sun Jul 27 23:13:02 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Mon, 28 Jul 2008 08:13:02 +0200 Subject: [gutvol-d] preprocessing definition In-Reply-To: References: Message-ID: <6E3D069A-3FCD-47F6-A7A3-D71BBDBF5F8C@uni-trier.de> Hi BB, I personally do not see the need for any human interaction. It is a matter of specifying what up want done ! Also, on the side of computing every process contains three basic steps: 1) preprocessing (preparing data structures, geting data, conversions, etc) 2) the main task 3) post-processing (returning data, clean-up, etc) These three steps are true of preprocessing. Yes, if you define the task during preprocessing to require human interaction, then you do need human interaction otherwise not. regards Keith. Am 23.07.2008 um 20:09 schrieb Bowerbird at aol.com: > roger said: > > I don't intervene or look at the pages myself. > > my experience is that preprocessing can't do all of what needs to > be done > if the methodology does not involve a human who will look at the > pages... > > -bowerbird > > > > ************** > Get fantasy football with free live scoring. Sign up for FanHouse > Fantasy Football today. > (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/03681eb6/attachment.htm From schultzk at uni-trier.de Sun Jul 27 23:25:33 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Mon, 28 Jul 2008 08:25:33 +0200 Subject: [gutvol-d] not a dialog In-Reply-To: <4887AA74.6000601@pobox.com> References: <238672185.93911216840318819.JavaMail.mail@webmail09> <4887AA74.6000601@pobox.com> Message-ID: Hi Roger, I do not quite see the problem. Of course that is probaly because I can program in some 15 languages fluently and thereby pick up a knew one easily. At least I can understand what is going on. So may I suggest that you look at the PHP code and 1) jot down what the code is doing. 2) do a little restructuring to afford ruby or perl 3) write the ruby or perl code Voila. You can now intergrate. Yes, some programmers have awful style(not meaning necessarily the ones involved, I have not seen the code). Hope this helps Keith. Am 24.07.2008 um 00:02 schrieb Roger Frank: > Joshua Hutchinson wrote: > > | Just wanted to pop in to ask if you (or anyone else) has > | looked into incorporating these checks into the proofing > | interface at DP? > > That would be a big boost to productivity. The difficulty > for me is that I'm comfortable with Ruby and Perl but > uncomfortable with PHP, and I think that's an important > deficiency for anyone wanting to integrate it at DP. > That's why for me it's a standalone utility, like guiprep, > only written in Ruby--it's just my limitation in being able > to put it inside a wrapper with something stronger than a > textbox widget. If I could find the equivalent of guiguts' > built in editor/presentation manager, only written in Ruby, > I would certainly use it. That would at least make it > interactive in a "proofing round 0" sense. > > So bottom line, for me the answer is that it's only a > "I wish I was smart enough to do that" kind of thing. As > a proofer myself at DP, I agree it would be a big win. > > --Roger Frank > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From schultzk at uni-trier.de Sun Jul 27 23:38:22 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Mon, 28 Jul 2008 08:38:22 +0200 Subject: [gutvol-d] not a dialog In-Reply-To: References: Message-ID: Hi All, Well, I actually do not know. Though I would say a lack of APIs and lack of modularity. I will go as far and say missing documentation. regards Keith. Am 24.07.2008 um 01:03 schrieb Bowerbird at aol.com: > dkretz said: > > You may remember that I implemented a new proofing interface > > a year or two ago, which provided a "preview" mode showing > > real italics, etc. That has since added a quote-matching display, > > and a punctuation reasonability-checker. They may still be on > > the dev server - I haven't checked for a long time. > > juliet said: > > We don't have it yet simply because > > none of our volunteer developers > > has been willing to tackle it. > > if somebody can sort this all out, do please explain it to me, ok? > > thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/d05b4a81/attachment.htm From Bowerbird at aol.com Mon Jul 28 14:20:28 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 28 Jul 2008 17:20:28 EDT Subject: [gutvol-d] getting my wikisource bearings Message-ID: john said: > I was almost going to explain why word hyphenation > is a bit pointless at the moment for Wikisource, but ... actually, i'd like to hear that explanation, so i'm well-rounded. > what the heck, I've updated pagescan 110 to split the lines > as you would expect them, using two different methods > to deal with hyphenation. the markup is rather obtrusive, and i think it would probably interfere with proofing. so i'm not sure it's an improvement. perhaps i need to step back and consider the context... > Your response here shocked me. i hope it was a tongue-on-a-9-volt-battery type of shock, not a "clear!"-and-slap-the-pads-against-his-chest shock. :+) > I dont think it would be faster going the OCR route, > but I'm going to be paying more attention to this. it's pretty fast to do o.c.r. you drag the files into abbyy, to make a "batch", and then you turn the program loose. > I cross checked this a few times, and then decided to go with > ishwar.com instead of sacred-texts.org. a merge of the two versions based on comparing them probably would have given you the most accurate text. > Interesting idea.? I've been considering > using o.c.r. to re-paginate a proofread text.? > It sounds like you're suggesting the opposite > would be more fruitful. well, whether you use the already-proofed text to bring the o.c.r. version up to final-quality, or (vice-versa-like) use the o.c.r. version to bring the already-proofed text to final-stage, the effect is the same either way. you're comparing the two and implementing whatever changes are necessary to finalize. > I'n not sure how often we will be re-populating works > using PG etexts, but there are many other transcribed books > floating across the internet without pagescans that we need to > seriously consider how to make use of the transcription work > already done. that's a good cause. doing the comparisons that i've mentioned is -- to my mind -- the best way to leverage the value of that work... so more on that later... > which means there is no need for rules; on a wiki, > common practise cautiously follows best practise. :-) "best practice" is good enough for me, i don't need rules. :+) (of course, it always helps if you spell "best practice" in the _correct_ way, instead of the _british_ way...) ;+) *** ...more thoughts later this week... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/f0c9dfd5/attachment.htm From Bowerbird at aol.com Mon Jul 28 14:32:04 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 28 Jul 2008 17:32:04 EDT Subject: [gutvol-d] woman in her own right -- 006 Message-ID: let's check "in her own right" to see if there are any floating question-marks. yep, 8 of 'em... > languid: ~" Been away, somewhere, haven't you ? > if Gaspard, his particular waiter, missed him ? > the Duvals didn't keep an eye on Greenberry Point ? > "You are determined ?--Very well, then, come > "But you're not quite sure ?--oh! modest man!" > moment, will you ?--you're hipped on it!" > "Than your Southern ancestors ?--isn't that > will be: ~' Come over and see us, won't you ?'" so our total number of undone corrections is now 269. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/473b358d/attachment.htm From Bowerbird at aol.com Mon Jul 28 15:28:03 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 28 Jul 2008 18:28:03 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 027 Message-ID: now we switch gears, ever so slightly, and do some housecleaning... our task is to get the _names_ in the book. first we will _check_ them; later, we will use the names as a _control_ during some further checks. (for instance, names will be an allowable exception when we check for words that are inappropriately capitalized in the middle of a sentence.) 27. get the names, and make corrections as needed... i've written several different routines for pulling names out of a text, but perhaps the simplest one is to pull out all capitalized words and then cull out the ones which are present in my standard dictionary... that routine gives the list i've appended. the top group -- with consecutive caps -- will be checked closely. and indeed 10 of the 13 were incorrect, and were fixed... because words that are in the dictionary are rejected from this list, the last name of "berry" was deleted, but that had no consequence. the first name of "lattice" -- where "lettice" had been misrecognized -- was also deleted from this list, so the 10 instances of that error would have become invisible, which is the most striking problem in this book, from the standpoint of the preprocessing. as for other errors that _would_ have been revealed, they were: > Al (was a misrecognition of "a1"), with 1 occurrence... > Erne (was a misrecognition of "effie"), with 2 occurrences... > Inan (was a misrecognition of "in an"), with 1 occurrence... > Itwas (was a misrecognition of "it was"), with 1 occurrence... > Kenny (was a misrecognition of "henny"), with 1 occurrence... > Malummon (misrecognition of "makimmon"), with 1 occurrence... > Mm (was a misrecognition of "him"), with 1 occurrence... > Tompey (was a misrecognition of "pompey"), with 1 occurrence... 18 more lines corrected, for a grand total of 318, on 27 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird > ALFRED'A'KNOPF > CONDON > HERGESHEIMER > JBeggs > MTMOTHER > Nickles'11 > OlfAMEEIOA > PENNYS > RUTHERFORD > T > T7NITED > TOPable > TT'VE > Adelaide > Al > Albermarle > Alexander > Arkansas > Barnwell > Bartamon > Beggs > Berrys > Buckley > Caley > Caleys > Chicago > Christ > Christmas > Clare > Condons > Crandall > Cri > Effie > Elias > Entriken > Erne > Eytalian > Fiesole > French > Goddy > Greenstream > Hagan > Hollidew > Hollidews > Inan > Indian > Itwas > Jackson > Jake > Jesuit > Jesus > June > Kenny > Khufu > Lettice > London > Loyola > MacKimmon > Makimmon > Makimmons > Malummon > Matthew > Memphis > Merlier > Meta > Methodist > Mm > Morley > Nickles > Nile > Ottinger > Otty > Paphian > Paris > Pelliter > Persia > Peterman > Pompey > Presbyterian > Saturday > Sim > Simeon > Simmons > Sprucesap > Stenton > Sunday > Tennessee > Themeny > Thursday > Tol'able > Tompey > Universalist > Vibard > Vibards > Wednesday > Wellbogast > Zebener ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/b683f40f/attachment.htm From jayvdb at gmail.com Mon Jul 28 18:02:05 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Tue, 29 Jul 2008 11:02:05 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: Message-ID: On Tue, Jul 29, 2008 at 7:20 AM, wrote: > john said: >> I was almost going to explain why word hyphenation >> is a bit pointless at the moment for Wikisource, but ... > > actually, i'd like to hear that explanation, so i'm well-rounded. At present, the proofreading view doesnt _show_ line breaks that exist in the raw text. e.g. http://en.wikisource.org/wiki/Page:Wind_in_the_Willows_(1913).djvu/110 So, without standard line breaks being visible, there is no incentive to break lines. Without incentives or a feedback loop, breaking words isn't likely to happen. >> what the heck, I've updated pagescan 110 to split the lines >> as you would expect them, using two different methods >> to deal with hyphenation. > > the markup is rather obtrusive, and i think it would probably > interfere with proofing. so i'm not sure it's an improvement. > > perhaps i need to step back and consider the context... The important element to grapple with is that we use the same raw text to proofread and publish. If we break a word across two lines for proofreading purposes, we need to join it back together in the published view. The markup does that. It isnt ideal, but it is a start. We _could_ enhance the parser to understand that a trailing '-' means the word is broken and needs to be merged in the published view. That seems like a very simple and incomplete solution, as compound words can also be broken by a hyphen at the end of the line, and that hyphen needs to be retained. We could use the double hyphen (=) where a compound word is broken across a line, in which case a single hyphen should be placed into the published view. There are very few cases where a published work will use a double hyphen at the end of the line, and _not_ mean that the word is a compound view. (The Japanese use of double hyphen is encoded as U+30A0) Another option is the use U+2027 (hyphenation point) at eol to encode that a line merge is required, allowing compound words to be encoded as "-". But then there will be times that a hyphenation point does actually exist at the eol in some works. More thought required; ideas welcome. >> Your response here shocked me. > > i hope it was a tongue-on-a-9-volt-battery type of shock, > not a "clear!"-and-slap-the-pads-against-his-chest shock. :+) Whichever it was, im still kickin. >> I dont think it would be faster going the OCR route, >> but I'm going to be paying more attention to this. > > it's pretty fast to do o.c.r. you drag the files into abbyy, > to make a "batch", and then you turn the program loose. OCR is the easy part. In this case, the archive.org DJVU file has an OCR layer, and I did use that for pagescans 1-80 because there was no existing transcription online for those pages. http://en.wikisource.org/wiki/Index:Sacred_Books_of_the_East_42.djvu >> I cross checked this a few times, and then decided to go with >> ishwar.com instead of sacred-texts.org. > > a merge of the two versions based on comparing them probably > would have given you the most accurate text. The ishwar.com etext is the sacred-texts.org etext with improvements. I've no interest in trying to wrangle both into alignment in order to do a comparison on them. If there were two disparate transcriptions both having significant errors, it might be worth it. >> Interesting idea. I've been considering >> using o.c.r. to re-paginate a proofread text. >> It sounds like you're suggesting the opposite >> would be more fruitful. > > well, whether you use the already-proofed text to bring the > o.c.r. version up to final-quality, or (vice-versa-like) use the > o.c.r. version to bring the already-proofed text to final-stage, > the effect is the same either way. you're comparing the two > and implementing whatever changes are necessary to finalize. Any existing code around to do something like this ? > (of course, it always helps if you spell "best practice" > in the _correct_ way, instead of the _british_ way...) ;+) You mean the "British" way surely ? They have not been demoted from being proper thing yet. -- John From schultzk at uni-trier.de Tue Jul 29 01:16:56 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Tue, 29 Jul 2008 10:16:56 +0200 Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 027 In-Reply-To: References: Message-ID: Hi All, BB, I have mentioned this many times. Have any of you tried thinking about parsing. That is takiong an analytic approach to the texts. The way I see it you are basically using pattern matching. Quite inefficient. Using parsing you pull in the text at the same time you are catching all those errors. An added feature is you have context information. Tools for doing this kind of work would be flex and bison. Any kind of flaging the text can be incorparated into the parser. On a side note, BB, your routine for names only works for english. But you are not interested in German anyway. How do you handle chapter titles and the like. regards Keith. Am 29.07.2008 um 00:28 schrieb Bowerbird at aol.com: > now we switch gears, ever so slightly, and do some housecleaning... > > our task is to get the _names_ in the book. first we will _check_ > them; > later, we will use the names as a _control_ during some further > checks. > (for instance, names will be an allowable exception when we check for > words that are inappropriately capitalized in the middle of a > sentence.) > > 27. get the names, and make corrections as needed... > > i've written several different routines for pulling names out of a > text, > but perhaps the simplest one is to pull out all capitalized words and > then cull out the ones which are present in my standard dictionary... > > that routine gives the list i've appended. > > the top group -- with consecutive caps -- will be checked closely. > and indeed 10 of the 13 were incorrect, and were fixed... > > because words that are in the dictionary are rejected from this list, > the last name of "berry" was deleted, but that had no consequence. > > the first name of "lattice" -- where "lettice" had been > misrecognized -- > was also deleted from this list, so the 10 instances of that error > would > have become invisible, which is the most striking problem in this > book, > from the standpoint of the preprocessing. > > as for other errors that _would_ have been revealed, they were: > > > Al (was a misrecognition of "a1"), with 1 occurrence... > > Erne (was a misrecognition of "effie"), with 2 occurrences... > > Inan (was a misrecognition of "in an"), with 1 occurrence... > > Itwas (was a misrecognition of "it was"), with 1 occurrence... > > Kenny (was a misrecognition of "henny"), with 1 occurrence... > > Malummon (misrecognition of "makimmon"), with 1 occurrence... > > Mm (was a misrecognition of "him"), with 1 occurrence... > > Tompey (was a misrecognition of "pompey"), with 1 occurrence... > > 18 more lines corrected, for a grand total of 318, on 27 routines... > > i'll be back tomorrow with the next suggestion in this series... > > -bowerbird > > > > ALFRED'A'KNOPF > > CONDON > > HERGESHEIMER > > JBeggs > > MTMOTHER > > Nickles'11 > > OlfAMEEIOA > > PENNYS > > RUTHERFORD > > T > > T7NITED > > TOPable > > TT'VE > > > > Adelaide > > Al > > Albermarle > > Alexander > > Arkansas > > Barnwell > > Bartamon > > Beggs > > Berrys > > Buckley > > Caley > > Caleys > > Chicago > > Christ > > Christmas > > Clare > > Condons > > Crandall > > Cri > > Effie > > Elias > > Entriken > > Erne > > Eytalian > > Fiesole > > French > > Goddy > > Greenstream > > Hagan > > Hollidew > > Hollidews > > Inan > > Indian > > Itwas > > Jackson > > Jake > > Jesuit > > Jesus > > June > > Kenny > > Khufu > > Lettice > > London > > Loyola > > MacKimmon > > Makimmon > > Makimmons > > Malummon > > Matthew > > Memphis > > Merlier > > Meta > > Methodist > > Mm > > Morley > > Nickles > > Nile > > Ottinger > > Otty > > Paphian > > Paris > > Pelliter > > Persia > > Peterman > > Pompey > > Presbyterian > > Saturday > > Sim > > Simeon > > Simmons > > Sprucesap > > Stenton > > Sunday > > Tennessee > > Themeny > > Thursday > > Tol'able > > Tompey > > Universalist > > Vibard > > Vibards > > Wednesday > > Wellbogast > > Zebener > > > > ************** > Get fantasy football with free live scoring. Sign up for FanHouse > Fantasy Football today. > (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/bfa20553/attachment-0001.htm From hart at pglaf.org Tue Jul 29 08:50:07 2008 From: hart at pglaf.org (Michael Hart) Date: Tue, 29 Jul 2008 08:50:07 -0700 (PDT) Subject: [gutvol-d] !@!ACTA trade agreement brief for July 29-31 Washington Message-ID: Feedback Wanted!!! WIKILEAKS URGENT DOCUMENT RELEASE Tue Jul 29 10:53:25 BST 2008 ACTA trade agreement industry negotiating brief on Border Measures and Civil Enforcement The ACTA negotiations are scheduled for 29 to 31 July 2008 in Washington DC. In 2007 a select handful of the wealthiest countries began a treaty-making process to create a new global standard for copyright, trademark and patent enforcement, which was called, in a piece of brilliant marketing, the "Anti-Counterfeiting Trade Agreement". ACTA is spearheaded by the United States, and includes the European Commission, Japan, and Switzerland -- which have large copyright and patent industries. Other countries invited to participate in ACTA's negotiation process are Canada, Australia, Korea, Mexico and New Zealand. Noticeably absent from ACTA's negotiations are leaders from developing countries who hold national policy priorities that differ from the international copyright and patent industry. This document is the ACTA negotiating brief dated July 29, 2008, provided by the copyright/patent/trademark industry to negotiating countries; pages concerning customs enforcement and civil enforcement. Under customs enforcement for example it proposes: * Increased inspection of goods to detect potential shipments * Customs to provide rights holders all relevant information for the purposes of their own private investigations and court action they are to be given a minimum of 20 working days to commence such actions. * Seized counterfeit goods are to be destroyed or disposed at the rights holders pleasure. Removing a trademark will not cut it. * Under civil enforcement rights holders will have more say on the damages involved as well as more compensation to cover their legal enforcement costs including "reasonable attorney's fees";. * Rights holders to get the right to obtain information regarding an infringer, their identities, means of production or distribution and relevant third parties. The exact composition of the business "side" is not known, which reflects the lack of transparency afflicting the ACTA process. Whether trade representatives can be forced to reveal the make-up to the press or policy groups remains to be seen. See http://wikileaks.org/wiki/S4 From Bowerbird at aol.com Tue Jul 29 09:06:22 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 29 Jul 2008 12:06:22 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 027 Message-ID: keith said: > Have any of you tried thinking about parsing. i never did like that word -- "parsing". whenever i say that what i am doing is "parsing", everything seems to fall apart. but as long as i don't _call_ it that, it all works quite nicely... > The way I see it you are basically using pattern matching. i don't know the distinctions in your terminology. > Using parsing you pull in the text at the same time > you are catching all those errors. my tool _does_ "pull in the text" in order to "catch the errors". i don't know how it would do it otherwise. > An added feature is you have context information. well, i have two reactions to that. one, when using my tool, you have the entire book as "context", with an entire page of text on-screen in front of you at all times, and the rest of the book available to you, and to "find" operations. two, as i have tried to have people infer from all of my examples, most of the time the _line_ is all the context that you really need. indeed, that's one true beauty when using a line-based approach. that's why 90% of what my tool does is based at the level of the line. > Tools for doing this kind of work would be flex and bison. > Any kind of flaging the text can be incorparated into the parser. i'm not familiar with those, but if you can repurpose them for use in this type of analysis, the so-called open-source proponents here would probably throw flowers your way... > On a side note, BB, your routine for names only works for english. > But you are not interested in German anyway. well, because german capitalizes all nouns (or something like that), you have more words that go into the hopper in the first place, yes, but since most of those nouns will be in the dictionary, they will be eliminated from the name-list, so that's no reason it wouldn't work. but you're right, i'm not really interested in german. not yet, anyway. some of these routines will have to be customized for each language, but i'll leave that task to people who actually _know_ each language. my motivation here is simply to show people how the task is done... > How do you handle chapter titles and the like. what's to handle? roman numerals are in my dictionary, thus recognized as not-names. any word in a chapter header and not in the dictionary goes on the list of possible names, so it's really no different than any other structure... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/e0337ea3/attachment.htm From Bowerbird at aol.com Tue Jul 29 09:07:36 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 29 Jul 2008 12:07:36 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 028 Message-ID: 28. do a search for comma-space-uppercase, controlling for names. this returns 36 hits, 33 of which are correct. the list is appended... 3 lines were fixed, where a period was misrecognized as a comma: > away, leaving her pale. Her lips trembled, A palpable, > night, Her eyes made liquid gleams in the wavering > the hand; and, leaning forward, touched it, A of those hits that were correct, some were _titles_, and some were names that double as valid words and thus were in the dictionary (like "buck" and "rose" and "valentine"), and lastly some were names that -- for one reason or another -- our name routines had missed, so we will add them now to our list of names. 3 more lines corrected, for a grand total of 321, on 28 routines... i'll be back tomorrow with the next suggestion in this series... -bowerbird > *** 3 lines were fixed (period was misrecognized as a comma): > away, leaving her pale. Her lips trembled, A palpable, > night, Her eyes made liquid gleams in the wavering > the hand; and, leaning forward, touched it, A > *** titles > *** dr. > He passed the Presbyterian Church, Dr. Pelliter's > *** friday > that night, and return the following day, Friday. > *** god > the whole, God forsaken place was worth a thousand," > *** general > regarding Gordon. "Here, here, General Jackson." > thank you for a panful of supper. Come on, General, > "Here, General, here," Gordon commanded, and > and playing him out. Come here, General JacK-son." > oath, but, before he could reach the ground, General > Jackson. C'm here, General." > "C'm here, General," Gordon called, suddenly > *** ginral > "C'm on in, doggy," he called; "c'm in, Ginral. > *** miss > rare in Greenstream. "Why, no, Miss Beggs," he > *** mr. > on the poles. Go in, Mr. Makimmon." > *** mrs. > garden patch beyond, Mrs. Caley said. Gordon > come on in the kitchen. No, Mrs. Caley won't > toward the peacefully grazing horse, Mrs. Caley sitting > before, Mrs. Caley left the room as he entered; and > *** names to be added to the list > *** alec > "And you go right around, Alec," his wife added, > *** augustus > say, Augustus," he demanded in eager, tremulous > *** buck > "Kick him again, Buck," he said; "kick him > "You oughtn't to have done that, Buck," Gordon > *** cannon > got this and that. Then, suddenly, Cannon wanted > -we'll say, Cannon does, with a note in my hand > *** gordon > The doctor greeted him seriously. He had, Gor- > *** mcginty > the throes of a new piece, Mc*Ginty, and Gordon > *** rose > he was intent upon some papers, Rose's husband > ain't, Rose." > *** sampson > "Chalk them up, Sampson," Gordon carelessly > *** tol'able > "Shut up, Tol'able," Buckley Simmons interposed, > where, from under a horse blanket, Tol'able pro- > *** valentine > Ah--" in spite of himself, Valentine Simmons be- > with the pink fox, Valentine Simmons. He thought > "By God, Valentine!" Gordon exclaimed, "I'll ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/64cdf54e/attachment.htm From Bowerbird at aol.com Tue Jul 29 09:08:33 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 29 Jul 2008 12:08:33 EDT Subject: [gutvol-d] woman in her own right -- 007 Message-ID: let's check "in her own right" to see if there are any comma-uppercase glitches. yep, 2 of 'em... > *' Nonsense! I understand--moreover, It will > She looked up at him tantalizingly, Her red so our total number of undone corrections is now 271. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/fe723bb6/attachment.htm From Bowerbird at aol.com Tue Jul 29 15:48:23 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 29 Jul 2008 18:48:23 EDT Subject: [gutvol-d] getting my wikisource bearings Message-ID: john said: > The important element to grapple with is that > we use the same raw text to proofread and publish. yes, i understand. that's part of what i called the "philosophical" reasons that might stand in the way of accomplishing this matter. either way, you _could_ -- if you wanted to -- give users the ability to _show_ the original linebreaks _or_ to rewrap. my default is to show the linebreaks as they occurred originally, and then to let the user call for a rewrap if they actually want it... > If we break a word across two lines for proofreading purposes, > we need to join it back together in the published view.? again, not necessarily. you could make it a user option... > The markup does that. > It isnt ideal, but it is a start. agreed. on all counts. > We _could_ enhance the parser to understand that a trailing '-' means > the word is broken and needs to be merged in the published view. some end-line dashes are meant to be maintained in the joined word, while others are not, so a distinction needs to be made to differentiate. my rule goes something like this: > if a line ends with a dash or a tilde, the rightmost character is removed. that way, to preserve a dash, i simply add a tilde to the end of the line, so the rule will remove the tilde. > That seems like a very simple and incomplete solution, > as compound words can also be broken by a hyphen > at the end of the line, and that hyphen needs to be retained. oops. i should have read ahead, and put that last bit here instead. > We could use the double hyphen (=) where a compound word is > broken across a line, in which case a single hyphen should be > placed into the published view.? There are very few cases where > a published work will use a double hyphen at the end of the line, > and _not_ mean that the word is a compound view. not sure i've ever seen such a double-hyphen. i try to live as much of my life as i can in the lower ascii characters... > The ishwar.com etext is the sacred-texts.org etext with improvements. > I've no interest in trying to wrangle both into alignment in order to > do a comparison on them.? If there were two disparate transcriptions > both having significant errors, it might be worth it. ok, i didn't understand that specific situation... > Any existing code around to do something like this ? not any that's being released. :+) but there are some free-of-monetary-cost apps that could be released. they're too raw for that now, but they certainly could get a little polish... anyone interested in being a beta-tester should backchannel me now... -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/c5101b33/attachment.htm From lee at novomail.net Tue Jul 29 17:18:31 2008 From: lee at novomail.net (Lee Passey) Date: Tue, 29 Jul 2008 18:18:31 -0600 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: Message-ID: <488FB357.1080307@novomail.net> John Vandenberg wrote: [snip] > The important element to grapple with is that we use the same raw > text to proofread and publish. > > If we break a word across two lines for proofreading purposes, we > need to join it back together in the published view. The markup does > that. It isnt ideal, but it is a start. > > We _could_ enhance the parser to understand that a trailing '-' means > the word is broken and needs to be merged in the published view. > > That seems like a very simple and incomplete solution, as compound > words can also be broken by a hyphen at the end of the line, and that > hyphen needs to be retained. We could use the double hyphen (=) > where a compound word is broken across a line, in which case a single > hyphen should be placed into the published view. There are very few > cases where a published work will use a double hyphen at the end of > the line, and _not_ mean that the word is a compound view. (The > Japanese use of double hyphen is encoded as U+30A0) > > Another option is the use U+2027 (hyphenation point) at eol to encode > that a line merge is required, allowing compound words to be encoded > as "-". But then there will be times that a > hyphenation point does actually exist at the eol in some works. > > More thought required; ideas welcome. In cases like this, my default approach is to try and figure out a solution using CSS. It seems to me that you have two problems here: 1. how to preserve line breaks in such a way that they are required in one context (proofreading) and optional in a different context (smooth reading); and 2. how to distinguish betweens hyphens used for hyphenation, and hyphens used for compound words. Assuming XHTML markup, one way to preserve line breaks would to use a CRLF plus the
tag wherever there was a line break in the original. Then, when non-proofing you would include a CSS rule that does not display
eaks. The problem with this approach is that in some instances you may want line breaks even when smooth reading. In this case, you would want to create some sort of classification where you can indicate "this line break is optional, but that line break is mandatory." Thus, you could have soft breaks, and hard breaks; an unclassified
element would be presumed to be a soft break, whereas
would be mandatory, and a CSS rule would be in place to enforce the line break. An alternative would be to use an invented element which has no meaning in XHTML. For example, select to indicate a line break in the original text; when proofing you would use a CSS rule that causes a line break whenever this particular element is encountered. When there is no CSS, the HTML spec says that unknown elements should simply be ignored, so having sprinkled throughout the text will be inconsequential when viewed in a standard HTML User Agent. When it comes to line-ending hyphenation, you are entering a realm fraught with controversy. While many, even on this list, would disagree with me, let's assume for the moment that "hard" hyphens are used for creating compound words, and "soft" hyphens are used for splitting words across lines for typographical purposes. The majority seems to believe that, in XML, the hard hyphen is indicated by the '-' character, whereas soft hyphens are indicated by the ­ entity. The minority, however, is very strident, and firmly convinced that /it/ is the majority. Decide for yourself. Whichever position is "right," we need a method to distinguish between hard (mandatory) hyphens, and soft (optional) hyphens, when they exist at the end of a typographical line (I assume that optional hyphens will never exist /inside/ a line). One solution is to simply use the ­ element whenever appropriate, and reserve '-' for use as a mandatory hyphen. Another is to combine hyphenation with the invented element we saw above. If we make the element a non-empty element, we could encapsulate the hyphen character inside the element; e.g. '-'. Now, when the display attribute for the element is set to "none", not only will there be no line break, the hyphen will disappear as well. Of course, in the source it will be important that there not be any white space surrounding the element, or you may get unanticipated wrapping, and will certainly see a space inside a "word." It is also true that in any User Agent which is...CSS-challenged...the hyphen will appear in all cases. On the other hand, if you want to make the hyphen disappear in all cases in these CSS-challenged UA's you could create a special class of line-breaks which only add a hyphen when the line break is of a certain class, and line breaks are enabled, e.g. '' (I tend to prefer this latter solution). Thus, you would encode your sample text from The Wind in the Willows as:

"Yes, and that's part of the trouble," continued the Rat. "Toad's rich, we all know; but he's not a millionaire. And he's a hopelessly bad driver, and quite regardless of law and order. Killed or ruined—it's got to be one of the two things, sooner or later. Badger! we're his friends—oughtn't we to do something?"

A hard hyphen would simply exist outside of the element, but it would still be important to not allow white space surrounding the element, e.g: This is important when using a CSS-challenged browser. [snip] >>> I dont think it would be faster going the OCR route, but I'm >>> going to be paying more attention to this. Like bowerbird, it has pretty much been my experience that starting over with OCR is faster than trying to re-insert lost data in to existing files, which is why I think that Project Gutenberg "plain vanilla text" files will have limited utility in the future. PG has lost so much, both in terms of markup and provenance, that it's faster to recreate the texts from scratch than it is to try and fix the old stuff. [snip] >>> Interesting idea. I've been considering using o.c.r. to >>> re-paginate a proofread text. It sounds like you're suggesting >>> the opposite would be more fruitful. >> >> well, whether you use the already-proofed text to bring the o.c.r. >> version up to final-quality, or (vice-versa-like) use the o.c.r. >> version to bring the already-proofed text to final-stage, the >> effect is the same either way. you're comparing the two and >> implementing whatever changes are necessary to finalize. > > Any existing code around to do something like this ? Yes. I have created some code to do this, which I would be happy to share with you, but I'm hoping someone else has done it better. I'm currently checking out HTML Match (http://www.htmlmatch.com/) which claims that it is able to "ignore the source code and compare only the text content of the web pages." If you're interested, I'll report back on what I find. From traverso at posso.dm.unipi.it Tue Jul 29 21:33:13 2008 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Wed, 30 Jul 2008 06:33:13 +0200 (CEST) Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: <488FB357.1080307@novomail.net> (message from Lee Passey on Tue, 29 Jul 2008 18:18:31 -0600) References: <488FB357.1080307@novomail.net> Message-ID: <20080730043313.CAEE91035B@posso.dm.unipi.it> >>>>> "Lee" == Lee Passey writes: Lee> John Vandenberg wrote: [snip] >>>> Interesting idea. I've been considering using o.c.r. to >>>> re-paginate a proofread text. It sounds like you're >>>> suggesting the opposite would be more fruitful. >>> well, whether you use the already-proofed text to bring the >>> o.c.r. version up to final-quality, or (vice-versa-like) use >>> the o.c.r. version to bring the already-proofed text to >>> final-stage, the effect is the same either way. you're >>> comparing the two and implementing whatever changes are >>> necessary to finalize. >> Any existing code around to do something like this ? Lee> Yes. Lee> I have created some code to do this, which I would be happy Lee> to share with you, but I'm hoping someone else has done it Lee> better. http//www.gnu.org/software/wdiff or emacs ediff in word mode are doing that excellently. Carlo From schultzk at uni-trier.de Wed Jul 30 02:10:33 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Wed, 30 Jul 2008 11:10:33 +0200 Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 027 In-Reply-To: References: Message-ID: Hi All, BB, Am 29.07.2008 um 18:06 schrieb Bowerbird at aol.com: > keith said: > > Have any of you tried thinking about parsing. > > i never did like that word -- "parsing". whenever i say that > what i am doing is "parsing", everything seems to fall apart. > but as long as i don't _call_ it that, it all works quite nicely... > > > > The way I see it you are basically using pattern matching. > > i don't know the distinctions in your terminology. Let see if I can get this into a nutshell and not get bashed! ;-)) Parsing does use pattern matching, but not the reverse. Pattern-matching basically just find something and does not analyze the strcuture of a text. Parsing will analyze the text and find its structure based on a so-called grammar. The grammar contains the rules for well-formedness. In its simplest for a parser will work only on what is well formed. A good parser will have rules for handling errors in the text and still give the structure. A parser can put the text into a particular structure based on the grammar. So what can the parser do for us. A parser will pull in the text and identify, the words sentences, quotes, chapter-headers, footenotes or whatever enties one defines. A parser can have context modes so that excepection handling and the identification of structures is aided. Using pattern matching you have to go through the pattern one after another. A parser will handle everthing in one pass if you wish by using look ahead and or look back. In a way you may say parsing is over-kill, but it has advantages because you can incorparate dictionary look in the process. I f you insist your entire preprocess is a multi-pass parser based on patterns. A parser is harder to develope because it is far more complex. > > > Using parsing you pull in the text at the same time > > you are catching all those errors. > > my tool _does_ "pull in the text" in order to "catch the errors". > i don't know how it would do it otherwise. > > > > An added feature is you have context information. > > well, i have two reactions to that. > > one, when using my tool, you have the entire book as "context", > with an entire page of text on-screen in front of you at all times, > and the rest of the book available to you, and to "find" operations. To me context information concerns the structure being analyzed. Not so much the co-text. > > > two, as i have tried to have people infer from all of my examples, > most of the time the _line_ is all the context that you really need. > indeed, that's one true beauty when using a line-based approach. > that's why 90% of what my tool does is based at the level of the line. Agreed, mostly. > > > > > Tools for doing this kind of work would be flex and bison. > > Any kind of flaging the text can be incorparated into the parser. > > i'm not familiar with those, but if you can repurpose them for use > in this type of analysis, the so-called open-source proponents here > would probably throw flowers your way... I do admit I am starting to get a itch. Have to think about it and see what rules and processing and interfaces they have. > > > > > On a side note, BB, your routine for names only works for english. > > But you are not interested in German anyway. > > well, because german capitalizes all nouns (or something like that), > you have more words that go into the hopper in the first place, yes, > but since most of those nouns will be in the dictionary, they will be > eliminated from the name-list, so that's no reason it wouldn't work. Just a pet tease ;-))) regards Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/66a72cae/attachment-0001.htm From tb at baechler.net Wed Jul 30 02:53:11 2008 From: tb at baechler.net (Tony Baechler) Date: Wed, 30 Jul 2008 02:53:11 -0700 Subject: [gutvol-d] Logging hits to PG files Message-ID: <48903A07.3010307@baechler.net> Hello, Obviously, PG has no control over what their mirrors do, so this is mostly about gutenberg.org and readingroo.ms. My question is this: How is PG logging and using information regarding what books are downloaded and by whom? I know gutenberg.org keeps Apache logs because they've been mentioned here before. What I'm wondering is what PG uses this information for and how it's used. I've never seen any mention of readingroo.ms logs even though it is an official PG server and is owned by Greg Newby. What prompted this question, besides the usual concerns about privacy and security, is that I've recently been setting up and using TOR. While I like the general concept and I very much like anonymous browsing, it is very, very slow and is not good for file downloads. I'm not too worried on one hand whether PG knows what I download or not, but on the other hand, an official published statement from PG would be nice. I'm also thinking of people outside of the US who either may not legally use this material (but do anyway, obviously) or who may not read it because of the restrictions placed on them by their governments. In my case, PG wouldn't get much out of my downloads anyway because I download everything in English with a plain text edition, but I would still be happier knowing that PG isn't going to use, sell, track, or otherwise make use of information like my IP address, browser, etc. I do trust PG to a point, but the philosophy of TOR is to trust no one and I'm starting to see more and more how easy it is to track someone's browsing habbits. The only reason why I don't switch to TOR for almost everything is that it is very slow and is very short on relays. [TOR https://www.torproject.org/] is the link for more information on TOR. There are versions for Windows, Mac, Linux, etc. Thanks very much for providing clarification about this. If this is in the FAQ somewhere, sorry for not finding it, but other than on this list and Pre-prints, I've not seen mention of readingroo.ms before. While I'm here, a similar statement about worldebookfair.com would be helpful. I don't trust worldebookfair.com or PGCC because they aren't directly under the control of PG, Newby or Hart. From marcello at perathoner.de Wed Jul 30 04:32:38 2008 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 30 Jul 2008 13:32:38 +0200 Subject: [gutvol-d] Logging hits to PG files In-Reply-To: <48903A07.3010307@baechler.net> References: <48903A07.3010307@baechler.net> Message-ID: <48905156.5090002@perathoner.de> Tony Baechler wrote: > Obviously, PG has no control over what their mirrors do, so this is > mostly about gutenberg.org and readingroo.ms. My question is this: How > is PG logging and using information regarding what books are downloaded > and by whom? gutenberg.org keeps Apache, Squid and ProFTP logs for one month in our private file area. I use them only for log analysis and for the top-100 list. A cron job deletes all logs older than one month. These files contain IP addresses, times of access and filenames. Only your provider can link the IP address back to you (if you are surfing from cable/DSL). Furthermore we keep the (Analog) statistics forever. These statistics contain IPs with more than 1.000 hits/day or 10.000 hits/month. They dont say which files those IPs accessed. I cannot say how long ibiblio keeps backups of our private files, nor if ibiblio keeps an own copy of the log files somewhere else and for how long. This is not an official statement. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Wed Jul 30 09:14:19 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Jul 2008 12:14:19 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 027 Message-ID: keith said: > So what can the parser do for us. A parser will pull in the text and identify, > the words sentences, quotes, chapter-headers, footenotes or whatever enties i don't see what that buys us, in terms of the job at hand -- correcting errors. > Using pattern matching you have to go through the pattern one after another. well, i've explained a while back that this is the way we _want_ to do this. generally, a certain "pattern" will be treated similarly whenever it occurs, so it's fastest to treat each pattern in sequence, rather than mixing them. preprocessing is typically better-executed as a _book-wide_ methodology, rather than a _page-by-page_ task, so much that it's part of the definition... > A parser will handle everthing in one pass if you wish > by using look ahead and or look back. i can handle everything in one pass too, if i write the code that way. > To me context information concerns the structure being analyzed. > Not so much the co-text. except people don't need that to check the text against the image. you're overcomplicating the actual task at hand. it's a simple task. > Just a pet tease ;-))) oh, ok... ;+) -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/20475e12/attachment.htm From Bowerbird at aol.com Wed Jul 30 12:58:22 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Jul 2008 15:58:22 EDT Subject: [gutvol-d] woman in her own right -- 008 (and final) Message-ID: the proofers have finished "in her own right" in p1 and p2. the p1 proofers changed well over 1200 lines: > http://z-m-l.com/go/wihor/wihor-c-ocr-to-p1.html the p2 proofers changed about 300 lines: > http://z-m-l.com/go/wihor/wihor-c-p1-to-p2.html those are the kind of numbers you get from _crappy_preprocessing_. it's an _insult_ to put that kind of awful text in front of volunteers, yet this text was _reserved_ for _newcomers_ at distributed proofreaders! the only justification would be if this was some kind of _hazing_ritual_. since roger frank has shown he will _take_ things personally, even if i don't _make_ them personal in the first place, let's get a little personal: "c'mon roger frank, you can do _much_ better than you've done here... the proofers give you 100%; how about if you give them at least 50%?" roger was quick to tell us how many books that he has submitted to p.g. well, sure, it's easy when you put the work on the backs of the proofers. he gives them straw, and they spin it into gold and give it back to him... i would be embarrassed to ask people to help me with this kind of slop -- embarrassed or ashamed, one of the two, and most probably both -- if i ever did, which i wouldn't... as it is, this series is finished. i'm not gonna waste any more of my time analyzing "data" from an "experiment" as ill-conceived as this one was... i slap a big old "f" -- for "flunk!" -- on this paper and i'm sendin' it back. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/6761e2c7/attachment.htm From sly at victoria.tc.ca Wed Jul 30 13:44:17 2008 From: sly at victoria.tc.ca (Andrew Sly) Date: Wed, 30 Jul 2008 13:44:17 -0700 (PDT) Subject: [gutvol-d] woman in her own right -- 008 (and final) In-Reply-To: References: Message-ID: I usually just skim past or delete BB posts. But I must say this one actually made me angry. So, I'll try to contain that, and post a civil response. Just a quick reminder for any newcomers around, that our friend BB has shown over the last few years a habit of telling others what they should do, but has still not (that I am aware of) contributed anything of measurable substance towards PG. My experience is that once in a while he does give you an idea to make you think, but overall his inflammatory comments and apparent inability to work with others at all have resulted in his being banned from three different message areas that I know of. His previous ban on this mailing list was only temporary for reasons of wanting to remain fair and open, which Greg Newby described quite well at the time. Andrew On Wed, 30 Jul 2008 Bowerbird at aol.com wrote: > the proofers have finished "in her own right" in p1 and p2. From Bowerbird at aol.com Wed Jul 30 14:29:07 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Jul 2008 17:29:07 EDT Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 029 Message-ID: alright, now we're working on the final few clean-up phases... today we will do a collection of all of the mid-line-hyphenates, those words which contain a single-dash inside of themselves. 29. search for mid-line-hyphenates the list is appended... these mid-line-hyphenates will help the computer _resolve_ questionable end-line-hyphenates, if we wanted it to do that. but for now, we're just looking for o.c.r. misrecognitions... most of these words were correctly recognized; some were not. mistakes typically involve a speck misrecognized as an en-dash or -- conversely -- an em-dash misrecognized as an en-dash... the other class of error here involves end-line hyphenates that were rejoined by the proofer without a removal of the hyphen... looking through them, it's fairly easy to pick out the ones which will probably need corrections. those are the only ones i'll check. of the 41 i pulled out for examination, 25 were indeed incorrect... *** some inconsistencies usually appear involves hyphenates... this book was no exception. consider the following cases: 6 occurrences of "to-day", versus 2 occurrences of "today"... *** to-day (6) > from school to-day, and at least provided an emergency > sharper than usual to-day." Above the stained > and to-day," the doctor replied evasively, "you > informed him concisely, "to-day." > "Right away! now! to-day!" > Hollidew's in Greenstream to-day. I don't know *** today (2) > shoes for a lady today--a generous present for some > from Lettice; and, today, he had recognized a note 9 occurrences of "to-morrow", versus 4 for "tomorrow"... *** to-morrow (9) > "I can give you something day after to-morrow, > that the latter wanted, must have, to-morrow. But > "To-morrow, about seven. Everything will be > to the door; it said, "Gone fishing. Back to-morrow." > Greenstream, be ready to-morrow--" > to-morrow I will feel different." > over on the western mountain, to-morrow night, at > the obscurity of the maples to-morrow night ... > "The stage goes out from Greenstream to-morrow; *** tomorrow (4) > tomorrow, or when I go to church." > he would have his wages tomorrow; however, if > back with you tomorrow. He's been down to > would be dangerous tomorrow. given these inconsistencies, nobody should hesitate to convert all of the archaic forms to today's spelling. (i do that regardless.) *** 25 more lines corrected, for a grand total of 346, on 29 routines... i'll be back tomorrow with the next tip in this series... -bowerbird -------------------------------------------------------------- here is the list of __ mid-line-hyphenates... the 7 most weird cases are listed first, with the rest alphabetical. look through this list and mark the ones that look "weird" to you. then scroll down and see if it matches the list _i_ think are weird. (WENTY-SEVEN eye- willing- G-G-God brother-in-law father-in-law long-drawn-out a-calling a-talking air-tight all-over animal-like any-thing ash-pit bag-like bare-necked barely-furnished beady-eyed beat-of black-clad blood-guiltiness blood-money blue-black blue-green blue-white broken-a bull-dog business-like canvas-covered carelessly-garbed chocolate-colored claret-colored clay-cold clear-eyed close-cropped close-cut close-haired close-lipped co-operation cold-blooded conscience-stricken cross-grained deep-shaded deeply-bitten deeply-grassed deeply-lined deeply-scrolled dining-room eighty-nine ever-trimmed ex-stage far-reaching fool-hearted fore-knowledge four-legged four-square freshly-colored freshly-flushed full-lidded gaily-attired gaily-patched gas-eously gaunt-jawed glanced,-each green-sickness greenish-black greenish-gold grey-green half-absently half-buried half-calculated half-closed half-distant half-grown half-heard half-heartedly half-hid half-mechanically half-way hammer-like heart-breaking heavily-built heavy-sweet high-power high-rolling highly-colored highly-simplified hollow-sounding home-knitted hoof-beats ill-concealed ill-considered ill-defined ill-directed ill-proportioned ill-will in-those inwardly-gratifying iron-like JacK-son know;-but leaden-faced leaden-grey leaden-hued leather-like left-hand life-long life-time lightly-struck long-accumulated long-drawn long-familiar long-lost loose-jointed loose-living low-drifting Mac-Kimmon Mai-son machine-like mahogany-colored mid-August milk-white moon-blanched mud-coated naked-seeming nephews-a newly-augmented newly-awakened newly-minted nice-hearted night-like nine-tenths off-hand old-time olive-colored one-time open-handed out-flung paper-shavers pasty-white pellu-cidly plum-colored plush-lined post-office prayer-meeting pride-fully public-spirited raw-boned re-departed re-entered red-clad red-headed robustly-witted rock-like rough-coated roughly-cleared Sim-mons' salt-raised sap-boil sap-boiling sap-boilings school-girl school-teacher self-approval self-assertion self-consciousness self-esteem self-headed self-sufficient semi-obscurity semi-ruin semi-surreptitiously seventy-five sharp-like sharp-witted sharply-cut shed-like sheep-cots silver-plated sing-song skull-like sleep-walker slightly-built slightly-varying slow-Idndling slowly-formulating so-so softly-swelling stage-driving stiffly-extended stone-bound store-keepers' straw-colored sulphur-yellow sun-heated swiftly-falling tar-paper thirty-eight thirty-one thousand-fold thread-bare throat.-"There tightly-folded to-day to-morrow to-night tobacco-stained toil-hardened too-a toy-like twenty-eight twenty-five twenty-seven two-seated ultra-blue under-garment under-turning undi-vined unformu-lated unhap-piness up-flung up-rolled vain-longing variously-colored violent-handed weather-beaten weather-proof weather-worn well-fed well-known white-hot white-powdered wild-eyed wind-herded wire-hound wooden-soled world-old yellow-red yellowish-white scroll down for the cases that i considered to be the "problem" cases... scroll down for the cases that i considered to be the "problem" cases... scroll down for the cases that i considered to be the "problem" cases... *** the problem cases (the ones marked with asterisks were errors) (WENTY-SEVEN * eye- * willing- * G-G-God brother-in-law father-in-law long-drawn-out all-over any-thing * beat-of * blood-guiltiness broken-a * co-operation ex-stage gas-eously * glanced,-each * green-sickness high-power ill-will in-those * JacK-son * know;-but * life-time * Mac-Kimmon * Mai-son * naked-seeming nephews-a * night-like pellu-cidly * pride-fully * Sim-mons' * slow-Idndling * store-keepers' * throat.-"There * too-a * under-turning undi-vined * unformu-lated * unhap-piness * vain-longing world-old ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/c94b2887/attachment.htm From Bowerbird at aol.com Wed Jul 30 14:53:25 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Jul 2008 17:53:25 EDT Subject: [gutvol-d] more revealing data as "mountain blood" exits p2 Message-ID: "mountain blood" was a test of parallel p1 over at d.p., by rfrank... after two parallel proofings in p1, their outputs were merged for p2. p2 did 50 edits, making 61 lines change. they're appended, and here: > http://z-m-l.com/go/mount/mount-c-p2results.html 3/4 of these edits -- 37 -- were "bureaucratic", necessary only due to stupid policies on ellipses, end-line-hyphenates, em-dashes, etc. basically, p2 just piddled through the book making needless changes. another _7_ errors were _mistakes_in_merging_. that is, _one_ of the parallel proofings got the line correct, but the person merging them chose the line from the _incorrect_ proofing instead of the correct one. p2 also caught 3 p-book errors that neither parallel proofings found. as it's not within the purview of proofers to catch these p-book errors, i don't count this against the p1 proofings. but i've mentioned it here to acknowledge this positive result from p2's word-by-word checking. finally, p2 did catch _2_ o.c.r. errors missed by both p1 proofings... offsetting that ever-so-slightly, p2 _introduced_ 1 error of its own. *** so, let me review. the parallel p1 proofings created a composite text with _2_errors_. that's pretty fantastic. ironically, the person who _merged_ those two parallel p1 proofings made _7_ errors during the merge process, thereby _dwarfing_ the 2 errors that were actually made. restating that, the person who did the merge made _3_times_ _more_mistakes_ in the merge process than the proofers made. *** i will leave it up to you to decide if all these proofings were "worth it". at 2 minutes average per page, every round took about _12_hours_... i _will_ remind you that aggressive preprocessing of this text corrected all but _3_ of the o.c.r. errors, and both p1 proofings caught all 3 of 'em. for the record, here are those 3 errors: > If they took away the chair, Gordon knew, he wag > If they took away the chair, Gordon knew, he was > "Why, damn it fell, Gord!" exclaimed an individual, > "Why, damn it t'ell, Gord!" exclaimed an individual, > la wed bank; a rate of interest a man can carry without > lawed bank; a rate of interest a man can carry without heck, even though i call that "aggressive preprocessing", the fact is that i'm documenting the routines to do it, and all of them are quite obvious. i've now described 29 routines, and none of them were esoteric at all... yet _none_ of these 29 routines were run against this "mountain" text. not a one. why not? i think that's absolutely _terrible_. it's a disgrace. had good preprocessing been done, all but 3 pages would've "no diffed" in both p1 proofings, and been perfect after either one, and these results would've clearly reflected the excellent state of this text at the end of this. as it was, this text received only a handful of "no diff" pages in either p1, and then in p2 another 44 of those pages recorded _yet_another_ "diff"... it is now waiting for p3, and how would anyone know it's almost perfect? once again, as we've seen time after time with all of these "experiments", the proofers _rock_, while the d.p. bureaucracy is _staggeringly_stupid_. but it will now be clear, to you and to rfrank, that parallel proofing "works". -bowerbird ********************* results from the "mountain blood" test-book... ********************* the 50 "errors" found by p2 after the p1 merge... ********************* bureaucracy ellipsis (20) ********************* bureaucracy end-line-hyphenate (12) ********************* bureaucracy em-dash (2) ********************* bureaucracy 8-bit (2) ********************* bureaucracy begin-chapter-caps (1) ********************* merge error (7) ********************* p-book error (3) ********************* real error that was caught (2) ********************* incorrectly changed by p2 (1) 1> -- p1merged 2> -- p2 proofing #> http://z-m-l.com/go/mount/mountp003.html 1 ********************* bureaucracy 8-bit 1> ALFRED A KNOPF 2> ALFRED ? A ? KNOPF #> http://z-m-l.com/go/mount/mountp009.html 2 ********************* bureaucracy begin-chapter-caps 1> The fiery disk of the sun was just lifting above 2> THE fiery disk of the sun was just lifting above #> http://z-m-l.com/go/mount/mountp023.html 3 ********************* merge error 1> V. 2> V #> http://z-m-l.com/go/mount/mountp031.html 4 ********************* bureaucracy end-line-hyphenate 1> hidden space, the village lay along its white highway. 2> hidden space, the village lay along its white high-*way. #> http://z-m-l.com/go/mount/mountp034.html 5 ********************* bureaucracy end-line-hyphenate 1> opposite side the mellow brick face of the Court-*house 2> opposite side the mellow brick face of the Courthouse #> http://z-m-l.com/go/mount/mountp041.html 6 ********************* bureaucracy ellipsis 1> "our link with the outer world, our faithful messenger....I 2> "our link with the outer world, our faithful messenger.... ********************* bureaucracy ellipsis (2nd line) 1> wanted to see you; ah, yes." He 2> I wanted to see you; ah, yes." He #> http://z-m-l.com/go/mount/mountp048.html 7 ********************* bureaucracy end-line-hyphenate 1> the lush grass, the greenish-gold sparks of the fireflies 2> the lush grass, the greenish-gold sparks of the fire-*flies #> http://z-m-l.com/go/mount/mountp052.html 8 ********************* bureaucracy ellipsis 1> the secretiveness, of night ... Greenstream village 2> the secretiveness, of night.... Greenstream village #> http://z-m-l.com/go/mount/mountp058.html 9 ********************* bureaucracy end-line-hyphenate 1> "Give me the man from the woods for an open-handed 2> "Give me the man from the woods for an openhanded #> http://z-m-l.com/go/mount/mountp087.html 10 ********************* bureaucracy ellipsis 1> injustice and injury. He hardened, grew defiant ... the 2> injustice and injury. He hardened, grew defiant ********************* bureaucracy ellipsis (2nd line) 1> strain of lawlessness brought so many years 2> ... the strain of lawlessness brought so many years #> http://z-m-l.com/go/mount/mountp098.html 11 ********************* bureaucracy 8-bit 1> They stood before the dark, porchless facade 2> They stood before the dark, porchless fa?ade #> http://z-m-l.com/go/mount/mountp100.html 12 ********************* bureaucracy ellipsis 1> for frugality, for independence, as a reserve ... 2> for frugality, for independence, as a reserve ********************* bureaucracy ellipsis (2nd line) 1> or for pleasure. It was the hottest hour of the 2> ... or for pleasure. It was the hottest hour of the ********************* bureaucracy ellipsis 1> in that banal setting, suddenly grew unbearable. 2> in that banal setting, suddenly grew unbearable.... #> http://z-m-l.com/go/mount/mountp111.html 14 ********************* bureaucracy end-line-hyphenate 1> would." He turned with a sigh to the log. A crosscut 2> would." He turned with a sigh to the log. A cross-*cut #> http://z-m-l.com/go/mount/mountp125.html 15 ********************* bureaucracy end-line-hyphenate 1> the dark house.... He shut his eyes for a mo* 2> the dark house.... He shut his eyes for a mo-* #> http://z-m-l.com/go/mount/mountp128.html 16 ********************* bureaucracy em-dash 1> stirring--three souls redeemed from everlasting torment 2> stirring---three souls redeemed from everlasting torment #> http://z-m-l.com/go/mount/mountp131.html 17 ********************* incorrectly changed by p2 1> say, Augustus," he demanded in eager, tremulous 2> say, Augustus,"he demanded in eager, tremulous #> http://z-m-l.com/go/mount/mountp134.html 18 ********************* bureaucracy ellipsis 1> "Teacher, kin I be excused? Teacher! ... Teacher--!'" 2> "'Teacher, kin I be excused? Teacher! ... Teacher--!'" #> http://z-m-l.com/go/mount/mountp151.html 19 ********************* merge error 1> in silkaleen and back in Al mohair, it'll stand you 2> in silkaleen and back in A1 mohair, it'll stand you #> http://z-m-l.com/go/mount/mountp153.html 20 ********************* bureaucracy ellipsis 1> charming little wife, large fortune at your disposal. 2> charming little wife, large fortune at your disposal.... ********************* bureaucracy ellipsis (2nd line) 1> ... Pompey left one of the solidest estates in this 2> Pompey left one of the solidest estates in this #> http://z-m-l.com/go/mount/mountp154.html 21 ********************* bureaucracy ellipsis 1> in the far future, perhaps in another generation. 2> in the far future, perhaps in another generation.... ********************* bureaucracy ellipsis (2nd line) 1> ... What would you say to a flat eight dollars an 2> What would you say to a flat eight dollars an #> http://z-m-l.com/go/mount/mountp174.html 22 ********************* p-book error 1> to the rod. General Jackson's head hung panting, 2> to the road. General Jackson's head hung panting, #> http://z-m-l.com/go/mount/mountp176.html 23 ********************* bureaucracy ellipsis 1> "Yes," she assented, "there was nothing else open.... Won't 2> "Yes," she assented, "there was nothing else open.... ********************* bureaucracy ellipsis (2nd line) 1> you come up and smoke a cigarette? 2> Won't you come up and smoke a cigarette? #> http://z-m-l.com/go/mount/mountp196.html 24 ********************* bureaucracy ellipsis 1> longer. You can't tread on me. It's going to stop 2> longer. You can't tread on me. It's going to stop ... now." ********************* bureaucracy ellipsis (2nd line) 1> ... now." 2> ??synchlineadded #> http://z-m-l.com/go/mount/mountp213.html 25 ********************* merge error 1> the astute storekeeper into such a satisfactory, retail-* 2> the astute storekeeper into such a satisfactory, retali-* #> http://z-m-l.com/go/mount/mountp217.html 26 ********************* merge error 1> of Lattice's." 2> of Lettice's." #> http://z-m-l.com/go/mount/mountp219.html 27 ********************* p-book error 1> hundred per cent. increase." 2> hundred per cent increase." #> http://z-m-l.com/go/mount/mountp246.html 28 ********************* bureaucracy end-line-hyphenate 1> the best clothes. I'll tell him I'm a poor schoolteacher 2> the best clothes. I'll tell him I'm a poor school-teacher #> http://z-m-l.com/go/mount/mountp256.html 29 ********************* bureaucracy ellipsis 1> had gone to the sap-boiling.... I sat up all night 2> had gone to the sap-boiling ... I sat up all night ********************* bureaucracy ellipsis 1> ... waiting.... I couldn't wait any longer, Gordon, 2> ... waiting ... I couldn't wait any longer, Gordon, #> http://z-m-l.com/go/mount/mountp281.html 31 ********************* merge error 1> ??line-was-missing?? 2> On an afternoon of the second autumn following #> http://z-m-l.com/go/mount/mountp286.html 32 ********************* bureaucracy ellipsis 1> now...." her blue gaze blurred with slow tears. 2> now ..." her blue gaze blurred with slow tears. #> http://z-m-l.com/go/mount/mountp308.html 33 ********************* bureaucracy end-line-hyphenate 1> necessary, and knock the bottom out of the store-keepers' 2> necessary, and knock the bottom out of the storekeepers' #> http://z-m-l.com/go/mount/mountp310.html 34 ********************* bureaucracy em-dash 1> make her, but it would certainly be accommodating--" 2> make her, but it would certainly be accommodating--" he ********************* bureaucracy em-dash (2nd line) 1> he paused interrogatively. 2> paused interrogatively. ********************* bureaucracy ellipsis 1> for the note, if it comes to that. But the fact is 2> for the note, if it comes to that. But the fact is ... I've ********************* bureaucracy ellipsis (2nd line) 1> ... I've got a lot of money laid out. What's been 2> got a lot of money laid out. What's been #> http://z-m-l.com/go/mount/mountp321.html 36 ********************* bureaucracy end-line-hyphenate 1> you are the son of your father. I knew your grandfather 2> you are the son of your father. I knew your grand-*father ********************* real error that was caught 1> "We've never been storekeepers," 2> "We've never been storekeepers." #> http://z-m-l.com/go/mount/mountp323.html 38 ********************* bureaucracy ellipsis 1> "I don't want to make! I don't want to take anything 2> "I don't want to make! I don't want to take anything ... never ********************* bureaucracy ellipsis (2nd line) 1> ... never again! I want--" 2> again! I want--" #> http://z-m-l.com/go/mount/mountp324.html 39 ********************* bureaucracy end-line-hyphenate 1> met the Company's agents, heard the agreement outlined; 2> met the Company's agents, heard the agreement out-*lined; #> http://z-m-l.com/go/mount/mountp331.html 40 ********************* merge error 1> but not Kenny's for nineteen years." Another bore, 2> but not Henny's for nineteen years." Another bore, #> http://z-m-l.com/go/mount/mountp337.html 41 ********************* bureaucracy ellipsis 1> read ... Why! ... Why, damn it! they had it 2> read ... Why!... Why, damn it! they had it #> http://z-m-l.com/go/mount/mountp338.html 42 ********************* bureaucracy ellipsis 1> the motives of such ill-considered...." 2> the motives of such ill-considered ..." #> http://z-m-l.com/go/mount/mountp343.html 43 ********************* bureaucracy ellipsis 1> I only stopped now to warn you away.... I'll hitch 2> I only stopped now to warn you away ... I'll hitch ********************* real error that was caught 1> of wrath, his arm rose, with a finger indicating the* 2> of wrath, his arm rose, with a finger indicating the #> http://z-m-l.com/go/mount/mountp346.html 45 ********************* bureaucracy end-line-hyphenate 1> Why, you've been the mutton of every little storekeeper 2> Why, you've been the mutton of every little store-*keeper #> http://z-m-l.com/go/mount/mountp362.html 46 ********************* bureaucracy ellipsis 1> done a thing like that. Why, just see...." Gordon 2> done a thing like that. Why, just see ..." Gordon #> http://z-m-l.com/go/mount/mountp365.html 47 ********************* bureaucracy end-line-hyphenate 1> *ing, dead planet. Gleams of light shot like quicksilver 2> *ing, dead planet. Gleams of light shot like quick-*silver #> http://z-m-l.com/go/mount/mountp367.html 48 ********************* p-book error 1> in his brain, was dulled. He place his foot upon the 2> in his brain, was dulled. He placed his foot upon the ********************* bureaucracy ellipsis 1> "If it hadn't been for you, what you did for me 2> "If it hadn't been for you, what you did for me ... others ... new ********************* bureaucracy ellipsis (2nd line) 1> ... others ... new courage, example of bigness--Why! 2> courage, example of bigness--Why! #> http://z-m-l.com/go/mount/mountp368.html 50 ********************* merge error 1> him to where, on. the bureau, a lamp had been left. 2> him to where, on the bureau, a lamp had been left. ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/7ad28603/attachment-0001.htm From Bowerbird at aol.com Wed Jul 30 15:31:52 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 30 Jul 2008 18:31:52 EDT Subject: [gutvol-d] planet strappers -- iteration #10 Message-ID: "planet strappers" -- the _perpetual_ p1 experiment over at d.p. -- has now finished up with it's 10th iteration of the proofing process. as reported earlier, this iteration also missed the error on page 33... better luck next iteration... that's not to say that this iteration didn't make lots of changes though! 140 lines were changed during this iteration! it's very hard to believe that so many o.c.r. errors survived through so many previous iterations. and indeed, your instincts are correct. these 140 changes were almost _entirely_ on "bureaucratic" changes, on ellipses, em-dashes, and so on. some of them were because the newbies who were doing this proofing simply don't understand the rules, and are applying them erroneously... but a good number of the changes are because the rules are _slippery_, and -- in order to exert a sense of agency -- proofers take advantage... for instance, how would you handle this end-of-line-hyphenate? > than the tonnage of the Port of Baltimore, to- > day." (that's an actual example, taken from "in her own right".) the d.p. rules state that a proofer should eliminate an unneeded hyphen. if the word is "today", then you would join it together as "today". great. but if the word is "to-day", then you would join it together as "to-day". and if you're not sure, the d.p. convention is to mark it as "to-*day". ok. but remember how i just showed you that some old books are inconsistent and they can use "today" and "to-day" in the same book? so a proofer who saw a case of "today" used on another page would eliminate the hyphen... another proofer who saw a case of "to-day" on another page would keep it. and a proofer who saw neither (or both) might mark the word as "to-*day"... so what you get is a proofer marking it one way in one iteration, and then another proofer marking it another way in another iteration, and then yet another proofing marking it the third way in another iteration. and then we go back to the first kind of proofer, who changes it in another iteration. it's rock-paper-scissors, and no one is ever "right", and it can go on forever. this is the kind of indeterminacy that poorly-thought policies buys you... *** there were a few cases (3) where a non-bureaucratic change was made. in 2 of these, iteration#10 was correcting an error made by iteration#9, but earlier rounds had had the line correct, so this doesn't really "count". in the third case, iteration#10 introduced a new error. #9 had it right. how's that? for every 2 errors you fix, you introduce 1 new one... talk about a perfect-glove fit for "two steps forward, one step back". *** to its credit, though, iteration#10 found _3_ more p-book errors. these really don't matter much, but it's amazing that nobody had noticed these errors before. (at least i don't _recall_ that they had; it's just not important enough to me that i go back and confirm it.) anyway, here are those 3 cases: > http://z-m-l.com/go/plans/plansp060.html > the mountain wall, imbedded in the dust of the mare. There > the mountain wall, embedded in the dust of the mare. There > http://z-m-l.com/go/plans/plansp139.html > theirs. I'll cover the rest of this batch: You'll be better than I > theirs. I'll cover the rest of this batch. You'll be better than I > http://z-m-l.com/go/plans/plansp141.html > first task that Nelsen had ever performed in space--the jockying > first task that Nelsen had ever performed in space--the jockeying again, no big deal, but surprising that, 10 rounds in, we still find stuff. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/b189451b/attachment.htm From jayvdb at gmail.com Wed Jul 30 17:22:18 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Thu, 31 Jul 2008 10:22:18 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: <20080730043313.CAEE91035B@posso.dm.unipi.it> References: <488FB357.1080307@novomail.net> <20080730043313.CAEE91035B@posso.dm.unipi.it> Message-ID: On Wed, Jul 30, 2008 at 2:33 PM, Carlo Traverso wrote: >>>>>> "Lee" == Lee Passey writes: > > Lee> John Vandenberg wrote: [snip] > > >>>> Interesting idea. I've been considering using o.c.r. to > >>>> re-paginate a proofread text. It sounds like you're > >>>> suggesting the opposite would be more fruitful. > >>> well, whether you use the already-proofed text to bring the > >>> o.c.r. version up to final-quality, or (vice-versa-like) use > >>> the o.c.r. version to bring the already-proofed text to > >>> final-stage, the effect is the same either way. you're > >>> comparing the two and implementing whatever changes are > >>> necessary to finalize. > >> Any existing code around to do something like this ? > > Lee> Yes. > > Lee> I have created some code to do this, which I would be happy > Lee> to share with you, but I'm hoping someone else has done it > Lee> better. > > http//www.gnu.org/software/wdiff or emacs ediff in word mode are > doing that excellently. I gave wdiff a whirl yesterday, first comparing the two online editions of SBE v42 Book 2, which worked like a treat, and then attempting to merge the corrected text into the OCR text. Pagescan: http://en.wikisource.org/wiki/Page:Sacred_Books_of_the_East_42.djvu/130 ----Raw OCR---- 3. Thou, (O Agni), rulest over all the animals of the earth, those which have been born, and those which are to be born : may not in-breathing leave this one, nor yet out-breathing, may neither friends nor foes slay him ! 4. May father Dyaus (sky) and mother Pr/thivi (earth), co-operating, grant thee death from old age, that thou mayest live in the lap of Aditi a hundred winters, guarded by in-breathing and out- breathing ! 5. Lead this dear child to life and vigour, O Agni, ----Clean text---- 3. Thou, (O Agni), rulest over all the animals of the earth, those which have been born, and those which are to be born: may not in-breathing leave this one, nor yet out-breathing, may neither friends nor foes slay him! 4. May father Dyaus (sky) and mother Prithivi (earth), co-operating, grant thee death from old age, that thou mayest live in the lap of Aditi a hundred winters, guarded by in-breathing and outbreathing! 5. Lead this dear child to life and vigour, O Agni, Varuna, and king Mitra! As a mother afford him protection, O Aditi, and all ye gods, that he may attain to old age! ----wdiff output---- [-3.-] {+3.+} Thou, (O Agni), rulest over all the animals of the earth, those which have been born, and those which are to be [-born :-] {+born:+} may not in-breathing leave this one, nor yet out-breathing, may neither friends nor foes slay [-him ! 4.-] {+him! 4.+} May father Dyaus (sky) and mother [-Pr/thivi-] {+Prithivi+} (earth), co-operating, grant thee death from old age, that thou mayest live in the lap of Aditi a hundred winters, guarded by in-breathing and [-out- breathing ! 5.-] {+outbreathing! 5.+} Lead this dear child to life and vigour, O Agni, [--] {+Varuna, and king Mitra! As a mother afford him protection, O Aditi, and all ye gods, that he may attain to old age!+} ---- end ---- In the above wdiff ouput, I've lost my end-of-lines that were in the original OCR text. I've looked at the wdiff options, and cant see which would do the trick. -- John Vandenberg From jayvdb at gmail.com Wed Jul 30 18:02:38 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Thu, 31 Jul 2008 11:02:38 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: References: <488FB357.1080307@novomail.net> <20080730043313.CAEE91035B@posso.dm.unipi.it> Message-ID: On Thu, Jul 31, 2008 at 10:22 AM, John Vandenberg wrote: > On Wed, Jul 30, 2008 at 2:33 PM, Carlo Traverso > wrote: >>>>>>> "Lee" == Lee Passey writes: >> >> Lee> John Vandenberg wrote: [snip] >> >> >>>> Interesting idea. I've been considering using o.c.r. to >> >>>> re-paginate a proofread text. It sounds like you're >> >>>> suggesting the opposite would be more fruitful. >> >>> well, whether you use the already-proofed text to bring the >> >>> o.c.r. version up to final-quality, or (vice-versa-like) use >> >>> the o.c.r. version to bring the already-proofed text to >> >>> final-stage, the effect is the same either way. you're >> >>> comparing the two and implementing whatever changes are >> >>> necessary to finalize. >> >> Any existing code around to do something like this ? >> >> Lee> Yes. >> >> Lee> I have created some code to do this, which I would be happy >> Lee> to share with you, but I'm hoping someone else has done it >> Lee> better. >> >> http//www.gnu.org/software/wdiff or emacs ediff in word mode are >> doing that excellently. > > I gave wdiff a whirl yesterday, first comparing the two online > editions of SBE v42 Book 2, which worked like a treat, and then > attempting to merge the corrected text into the OCR text. > > Pagescan: > > http://en.wikisource.org/wiki/Page:Sacred_Books_of_the_East_42.djvu/130 > > ----Raw OCR---- > > 3. Thou, (O Agni), rulest over all the animals of > the earth, those which have been born, and those > which are to be born : may not in-breathing leave > this one, nor yet out-breathing, may neither friends > nor foes slay him ! > 4. May father Dyaus (sky) and mother Pr/thivi > (earth), co-operating, grant thee death from old > age, that thou mayest live in the lap of Aditi a > hundred winters, guarded by in-breathing and out- > breathing ! > 5. Lead this dear child to life and vigour, O Agni, > > > ----Clean text---- > > 3. > Thou, (O Agni), rulest over all the animals of the earth, those which have been > born, and those which are to be born: may not in-breathing leave this one, nor > yet out-breathing, may neither friends nor foes slay him! > > > > 4. May father Dyaus > (sky) and mother Prithivi (earth), co-operating, grant thee death from old age, > that thou mayest live in the lap of Aditi a hundred winters, guarded by > in-breathing and outbreathing! > > > > 5. Lead this dear child to life and vigour, O > Agni, Varuna, and king Mitra! As a mother afford him protection, O Aditi, and > all ye gods, that he may attain to old age! > > > ----wdiff output---- > > > [-3.-] > {+3.+} > Thou, (O Agni), rulest over all the animals of the earth, those which have been > born, and those which are to be [-born :-] {+born:+} may not > in-breathing leave this one, nor > yet out-breathing, may neither friends nor foes slay [-him ! > 4.-] {+him! > > > > 4.+} May father Dyaus > (sky) and mother [-Pr/thivi-] {+Prithivi+} (earth), co-operating, > grant thee death from old age, > that thou mayest live in the lap of Aditi a hundred winters, guarded by > in-breathing and [-out- > breathing ! > 5.-] {+outbreathing! > > > > 5.+} Lead this dear child to life and vigour, O > Agni, > [--] {+Varuna, and king Mitra! As a mother afford him protection, O Aditi, and > all ye gods, that he may attain to old age!+} > > ---- end ---- > > In the above wdiff ouput, I've lost my end-of-lines that were in the > original OCR text. I've looked at the wdiff options, and cant see > which would do the trick. The solution came to me: call it as "wdiff clean-text ocr-text", resulting in: ---- [-3.-]{+3.+} Thou, (O Agni), rulest over all the animals of the earth, those which have been born, and those which are to be [-born:-] {+born :+} may not in-breathing leave this one, nor yet out-breathing, may neither friends nor foes slay [-him! 4.-] {+him ! 4.+} May father Dyaus (sky) and mother [-Prithivi-] {+Pr/thivi+} (earth), co-operating, grant thee death from old age, that thou mayest live in the lap of Aditi a hundred winters, guarded by in-breathing and [-outbreathing! 5.-] {+out- breathing ! 5.+} Lead this dear child to life and vigour, O Agni, [-Varuna, and king Mitra! As a mother afford him protection, O Aditi, and all ye gods, that he may attain to old age!-] {++} --- Is there a GUI or command line front-end for wdiff, to allow interactive accept/reject of each change? -- John Vandenberg From schultzk at uni-trier.de Thu Jul 31 00:13:54 2008 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Thu, 31 Jul 2008 09:13:54 +0200 Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book -- 027 In-Reply-To: References: Message-ID: Am 30.07.2008 um 18:14 schrieb Bowerbird at aol.com: > keith said: > > So what can the parser do for us. A parser will pull in the > text and identify, > > the words sentences, quotes, chapter-headers, footenotes or > whatever enties > > i don't see what that buys us, in terms of the job at hand -- > correcting errors. Well, what I said about would belong more in helping mark up the text. YET, identifing words, sentences, and quotes and making sure they are balanced can help in finding the errors. > > > > > Using pattern matching you have to go through the pattern one > after another. > > well, i've explained a while back that this is the way we _want_ to > do this. > > generally, a certain "pattern" will be treated similarly whenever > it occurs, > so it's fastest to treat each pattern in sequence, rather than > mixing them. Well, parsing will not mix them up and if the parser is written as to tag the error (or correct it) as being such an such error. > > > preprocessing is typically better-executed as a _book-wide_ > methodology, > rather than a _page-by-page_ task, so much that it's part of the > definition... So what its the problem. A parser could care less. > > > > > A parser will handle everthing in one pass if you wish > > by using look ahead and or look back. > > i can handle everything in one pass too, if i write the code that way. > > > > To me context information concerns the structure being analyzed. > > Not so much the co-text. > > except people don't need that to check the text against the image. > you're overcomplicating the actual task at hand. it's a simple task. I never talked about using the image! regards Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/0a1dc32e/attachment.htm From jayvdb at gmail.com Thu Jul 31 01:18:50 2008 From: jayvdb at gmail.com (John Vandenberg) Date: Thu, 31 Jul 2008 18:18:50 +1000 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: <20080730043313.CAEE91035B@posso.dm.unipi.it> References: <488FB357.1080307@novomail.net> <20080730043313.CAEE91035B@posso.dm.unipi.it> Message-ID: On Wed, Jul 30, 2008 at 2:33 PM, Carlo Traverso wrote: >>>>>> "Lee" == Lee Passey writes: > > Lee> John Vandenberg wrote: [snip] > > >>>> Interesting idea. I've been considering using o.c.r. to > >>>> re-paginate a proofread text. It sounds like you're > >>>> suggesting the opposite would be more fruitful. > >>> well, whether you use the already-proofed text to bring the > >>> o.c.r. version up to final-quality, or (vice-versa-like) use > >>> the o.c.r. version to bring the already-proofed text to > >>> final-stage, the effect is the same either way. you're > >>> comparing the two and implementing whatever changes are > >>> necessary to finalize. > >> Any existing code around to do something like this ? > > Lee> Yes. > > Lee> I have created some code to do this, which I would be happy > Lee> to share with you, but I'm hoping someone else has done it > Lee> better. > > http//www.gnu.org/software/wdiff or emacs ediff in word mode are > doing that excellently. In my investigation, I have found another simple program called dwdiff, which is mostly commandline compatible with wdiff. Only the --autopager, --terminal and --avoid-wraps options are not supported. Here is the full set: -C, --copyright print Copyright then exit -V, --version print program version then exit -1, --no-deleted inhibit output of deleted words -2, --no-inserted inhibit output of inserted words -3, --no-common inhibit output of common words -a, --auto-pager automatically calls a pager -h, --help print this help -i, --ignore-case fold character case while comparing -l, --less-mode variation of printer mode for "less" -n, --avoid-wraps do not extend fields through newlines -p, --printer overstrike as for printers -s, --statistics say how many words deleted, inserted etc. -t, --terminal use termcap as for terminal displays -w, --start-delete=STRING string to mark beginning of delete region -x, --end-delete=STRING string to mark end of delete region -y, --start-insert=STRING string to mark beginning of insert region -z, --end-insert=STRING string to mark end of insert region http://os.ghalkes.nl/dwdiff.html http://www.linux.com/articles/114176 In my wandering of the web, I found the attached file mentioned here http://mail.python.org/pipermail/tutor/2002-April/013928.html ... hidden away in the archive ... http://web.archive.org/web/20020313231458/mike-labs.com/wd2h/wd2h.html wd2h.pl shows the rough algorithm that wdiff is performing. -- John Vandenberg -------------- next part -------------- A non-text attachment was scrubbed... Name: wd2h.pl Type: application/octet-stream Size: 10431 bytes Desc: not available Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/6e2ac5b8/attachment-0001.obj From rfrank at pobox.com Thu Jul 31 06:43:51 2008 From: rfrank at pobox.com (Roger Frank) Date: Thu, 31 Jul 2008 09:43:51 -0400 Subject: [gutvol-d] woman in her own right -- 008 (and final) In-Reply-To: References: Message-ID: On Wed, 30 Jul 2008 13:44:17 -0700 (PDT), Andrew Sly wrote: > > I usually just skim past or delete BB posts. > But I must say this one actually made me angry. After Andrew's post, I couldn't resist digging the referenced post out of the trash to see what it said. ("the proofers have finished 'in her own right' in p1 and p2.") > as it is, this series is finished. i'm not gonna waste any more of my > time analyzing "data" from an "experiment" as ill-conceived as this > one was... "i'm not gonna waste and more of my time...." means to me that BB felt the analysis he had already done was a waste of time. I wonder how many would agree. Since the many of BB's posts that I saw before I put the kill filter on were all about me and my work, I wonder what crusade BB will start next? So what else is in here? Will it be amateurish, contrived, selective, inflammatory, ad hominem, or a combination of these? Let's see.... > the p1 proofers changed well over 1200 lines What does that mean? Of course it sounds bad, because it is meant to sound bad. Well, the page headers were left in for this book (and noted in the project thread) to see if forcing attention to the top of the page would make the proofers more accurate regarding top-of-page paragraph dileneation. So out of the 1200 lines, about half of them were adjustments to the top of the page markup on most of the 337 pages. Of those that remain, it averages to just about two lines with corrections per page. I'm comfortable with that. > since roger frank has shown he will _take_ things personally, even if > i don't _make_ them personal in the first place, let's get a little > personal.... An interesting, if specious, justification for a personal attack. > roger was quick to tell us how many books that he has submitted to > p.g. well, sure, it's easy when you put the work on the backs of the > proofers. he gives them straw, and they spin it into gold and give it > back to him... Absolutely the proofers, formatters, smoothies and the posting team all play a big part in any book's journey from scanner to posting. I've made that comment many times. But the quoted passage misses the point entirely (as usual, as it's meant to do, for effect.) The reason I mentioned that I had posted several books to PG was not to compare or minimize anyone who hasn't but to point out that with postprocessing-time comes experience. That experience can only help a contributor who is working to make the DP/PG process better. I think back to when I became a high-school teacher. I was working at my engineering job in the day and teaching computer systems engineering at a local college at night. I decided that I liked teaching better and decided to become a high-school math teacher. I went to night school for a few years to get my teaching license and a M.A. in Education. I knew the math because of my engineering career. I knew about teaching from my studies, or thought I did. But it was all put in perspective by a wise, experienced teacher who told me right at the start of my first math teaching assignment: "You won't know this material and you won't know how to teach it until you've taught if for three years." He was absolutely right. I did mention that I had posted several books to DP and that I felt the experience made me a better contributor, but it may have been better stated that I have been logging in and working at DP every day for over three years and with that has come an understanding that I don't believe anyone can get without that experience. I'm not saying anyone without experience should not speak up. Anyone can have a good idea, and sometimes the freshest, most interesting ideas come from people who are distanced from the current process. I personally wish this list were moderated, so those fresh ideas and "better ways" could be presented and discussed without the personal attacks and without getting a letter grade from the self-appointed evaluator who trolls this list and permeates it with negative posts. Also, from that same quote above, I consider it an insult to the many hard-working post-processors to state that "it's easy" under any circumstances to do that work. --Roger Frank From hart at pglaf.org Thu Jul 31 09:48:48 2008 From: hart at pglaf.org (Michael Hart) Date: Thu, 31 Jul 2008 09:48:48 -0700 (PDT) Subject: [gutvol-d] woman in her own right -- 008 (and final) In-Reply-To: References: Message-ID: Since there is not substantive content to this purported "civil response" it still falls into the FLAME category. Please refrain from sending messages that only contain a personal attack, no matter how "civil" you think you may have stated your attack. This puts you on the list for being banned if and when a situation arises if/when this list supports censorship. Not that I have any particular interest in defending all the posts by bowerbird, but I feel I should point out to the concerned parties that he has been set up to quite a significant degree by "tag team flamers" who alternate a series of message carefully gauged to antagonize him but are couched in these "civil" terms and, when confronted, the "tag team flamers" each say they only send a minimal number of messages each while bowerbird sent many. . .an equal number perhaps to those from the tag team flamers. I have also seen this on a number of other listservers-- and I hope it makes it into some sort of list manual. Michael Hart Founder Project Gutenberg On Wed, 30 Jul 2008, Andrew Sly wrote: > > I usually just skim past or delete BB posts. > But I must say this one actually made me angry. > > So, I'll try to contain that, and post a civil response. > > Just a quick reminder for any newcomers around, > that our friend BB has shown over the last few years > a habit of telling others what they should do, > but has still not (that I am aware of) contributed > anything of measurable substance towards PG. > > My experience is that once in a while he does give you > an idea to make you think, but overall his inflammatory > comments and apparent inability to work with others > at all have resulted in his being banned from three > different message areas that I know of. > > His previous ban on this mailing list was only temporary > for reasons of wanting to remain fair and open, > which Greg Newby described quite well at the time. > > Andrew > > On Wed, 30 Jul 2008 Bowerbird at aol.com wrote: > >> the proofers have finished "in her own right" in p1 and p2. > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From hart at pglaf.org Thu Jul 31 10:50:14 2008 From: hart at pglaf.org (Michael Hart) Date: Thu, 31 Jul 2008 10:50:14 -0700 (PDT) Subject: [gutvol-d] !@! Re: woman in her own right -- 008 (and final) In-Reply-To: References: <4891EF2C.7020309@xs4all.nl> Message-ID: I have received several notes, mostly private, that seem to have ignored my statement that I am not interested in types of defenses for bowerbird, and I should add that he will be the first, and has been in the past, to say no need for it. However, flames are flames, and should be pointed out by an internet listserv moderator, which I have done. If you are all so adamant about toasting bowerbird I now do a repeat of a theme we have discussed often. You are ALL welcome to start your own listsevers at expense to be 100% defrayed by Project Gutenberg. So, once again, I simply point out that if you don't want a contact with bowerbird. . .which you all SAY. . .all you do is start your own listserver and don't let him in, or use a heavy hand on "moderation" if you do let him in. These solutions are simple. Always have been available. The fact that you don't use them belies the claim that your real interest is not hearing from bowerbird. I agree with with many of the comments I have received that this ongoing flame war is NOT a good thing. If you just ignored bowerbird, it would not be there. Once again I have been pilloried for NOT killing him off in a situation you could have eliminated several easy ways. If you think you can force me into using moderation weapons then I suggest you think again. . . . If I ever use such weapons, those who wanted it will be the first to go. . . . This is an open list, and will remain so as long as you are on it. . .so start your own list if you want otherwise. We will gladly pay all the expenses, provide the hardware & software necessary. The way things are you are just mountaining a molehill. Please stop. Hoping to thank you for consideration in the near future, Michael Hart Founder Project Gutenberg From Bowerbird at aol.com Thu Jul 31 11:22:52 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 31 Jul 2008 14:22:52 EDT Subject: [gutvol-d] woman in her own right -- 008 (and final) Message-ID: roger said: > "i'm not gonna waste and more of my time...." means to me that > BB felt the analysis he had already done was a waste of time. some of the analyses i've done have been worthwhile. but i've already adequately demonstrated that shoddy preprocessing wastes the time of the proofers, so there's no use proving that again. do you really think you can continue running crappy "experiments" and that i'm gonna continue spending my time "analyzing" all of 'em? > Well, the page headers were left in for this book (and noted in > the project thread) to see if forcing attention to the top of the page > would make the proofers more accurate regarding top-of-page > paragraph dileneation. this is cute. "we're gonna leave in some errors on purpose, to see if it helps the proofers be more accurate when it comes to other errors." however, as repugnant as that sounds, if it was supported by _data_, it might be interesting. but notice that roger has given us no data... and even if he had, it wouldn't be all that meaningful, because the top-of-page new-paragraph blank-lines are easy enough to insert _automatically_, in preprocessing. which makes the phrase go like: "we're gonna leave in some errors that we can find automatically to see if it helps the proofers find other errors we can find automatically." it'd be better to just find all the errors automatically, and _fix_them_... > So out of the 1200 lines, about half of them were adjustments to > the top of the page markup on most of the 337 pages. um, no. at most that would be 337, which is hardly "about half". i don't even count the addition or deletion of blank lines, at the top of the page or anywhere on a page, as that's a "bureaucratic" change, since the paragraphing can (and should) be done in preprocessing... > Of those that remain, it averages to > just about two lines with corrections per page. > I'm comfortable with that. well, i'm glad roger is "comfortable with that". but since the preprocessing that i did on this book -- using nothing but obvious cleaning routines -- left just 3 errors in the _entire_book_, my own feeling is that i wouldn't be comfortable unless i did that well. two corrections per page means "this book needs another round". 3 corrections for the entire book means "this one can go out now". that's a _huge_ difference. > An interesting, if specious, justification for a personal attack. no. you _clearly_ demonstrated last week that you _did_indeed_ take what i said personally, even though it wasn't written that way. i just gave you a sample of what it looks like when i do get personal. and, by the way, saying "you can do a lot better" is hardly an insult... you, on the other hand, seem to feel quite comfortable calling me "a troll" and making all kinds of other uncomplimentary accusations. it seems that your brand of "politeness" has an escape-clause in it. (which is true of most "win friends/influence people" proponents; evidently if someone is not susceptible to your charms, it gives you a full and complete license to abuse them in any possible manner, which just goes to show how thin the veneer is on that philosophy.) > But the quoted passage misses the point entirely > (as usual, as it's meant to do, for effect.) no, the "quoted passage" said that you are giving proofers something that is _shoddy_ -- because it was subjected to bad preprocessing -- when you could be giving them something that is clearly much better, as i am showing in the separate series on "how to do preprocessing". i took the error-rate in the book down to _3_errors_. _three_, roger. your so-called preprocessing left 1200 errors in the book, or more... if you don't see the difference, _you_ have "missed the point entirely". just do the job right. that's all i'm asking. is it really all that hard? > I think back to when I became a high-school teacher. well, thanks for the folksy anecdote... now, will you please go back and improve your preprocessing tool? because this has _nothing_ to do with you -- at all, in the slightest -- and _everything_ to do with how the workflow at d.p. is constituted... thank you. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/fd1b6f06/attachment.htm From Bowerbird at aol.com Thu Jul 31 11:33:35 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 31 Jul 2008 14:33:35 EDT Subject: [gutvol-d] woman in her own right -- 008 (and final) Message-ID: andrew said: > But I must say this one actually made me angry. andrew, if you can't refute the evidence i've offered -- or at least address it in a substantive manner -- then it will be better if you don't say anything at all, because this just shows how flimsy your reaction is. > but has still not (that I am aware of) contributed > anything of measurable substance towards PG. maybe you don't see the value of the many analyses that i have done on the various d.p. experiments, or my constructive criticism and frequent suggestions... but i can assure you the future _will_ see the value -- what with hindsight being 20/20 and all that -- and then it will mock you for your shortsightedness. wouldn't be the first time the outsider was right and the insiders -- all of 'em -- were wrong wrong wrong. -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/4cb577ed/attachment.htm From richfield at telkomsa.net Thu Jul 31 11:11:28 2008 From: richfield at telkomsa.net (Jon Richfield) Date: Thu, 31 Jul 2008 20:11:28 +0200 Subject: [gutvol-d] !@!ACTA trade agreement brief for July 29-31 Washington Message-ID: <48920050.6070004@telkomsa.net> Sorry Michael, this feedback is probably late and certainly plaintive rather than constructive. But you DID ask! so, fwiw: My son, who has strong views on the subject said, in reply to my passing on your message: ====================== This sort of thing is going on with nauseating regularity, alas. The trouble behind it is that: a) a lot of politicians have accepted without review the claims of the intellectual property brigade b) a lot of companies have discovered that while outright bribery is illegal, politically sensitive gifts work wonders from a policy standpoint. This is a classic protectionist move, really. All the old brands of the first world want protection from knockoffs from elsewhere, and they're trying to get the government to do their work for them. ====================== Not that I think that he should be so uncharitably cynical of course, but a bad upbringing will out. Anyway, the sum is that some people think that in this way they might make more money by depriving the public of things that they had been entitled to; who are we to disillusion them by frustration? Obscurely I am reminded of something Bierce said: "...I must take the liberty to remind him that the law of supply and demand is not imperative; it is not a statute but a phenomenon. He may reply: "It is imperative; the penalty for disobedience is failure. If I pay more in salaries and wages than I need to, my competitor will not; and with that advantage he will drive me from the field." If his margin of profit is so small that he must eke it out by coining the sweat of his workwomen into nickels I've nothing to say to him. Let him adopt in peace the motto, "I cheat to eat." I do not know why he should eat, but Nature, who has provided sustenance for the worming sparrow, the sparrowing owl and the owling eagle, approves the needy man of prey and makes a place for him at table." It might strike anyone that I draw a strained comparison between the exploiter of women in employment and exploiters of "intellectual property" that they largely had no hand in creating, and far more largely doom to oblivion simply by keeping them out of circulation, rather than the lesser crime of profiting from printing what is neither logically nor honestly theirs. True no doubt, but the motto "I cheat to eat." springs nimbly to mind. That they should compound parasitism with dog-in-the-mangerism is distasteful rather than astonishing. People who have noted some of the titles that I have provided either for Gutenberg US or AU, might be slightly puzzled at my choices, but they embody a strong trend towards worthwhile books that are little known and out of print. Such legislation tends to the total loss of such books. That loss is a loss to society. It adds nothing to the material gain for parasites who couldn't give a damn either way if it means no money for them one way or the other, so I don't waste my breath on them. Some books are of direct value because of their content, and their loss may be loss as such. Others are losses because they have value in their relevance to the study of ideas in their times and communities. This too is a loss, so pleading that only worthless materials will fail to get published is unworthy. What we have here is the veto of public good in the interests of greed and sloth. Maybe what we need is some sort of register of titles to which parties might submit lists of material that they desire to publish on a not-for-profit basis. Then if someone objects because they have both the right and intention to publish it commercially instead, their right prevails. Otherwise we innocents could publish those titles for the benefit of readers who are not in a position to inflate the coffers of that good Mr. Munniglut that Bierce referred to in the work I quoted: "...contentedly smoothing the folds out of the superior slope of his paunch, exuding the peculiar aroma of his oleaginous personality and larding the new roadway with the overflow of a righteousness stimulated to action by relish of his own identity. And ever thereafter the subtle suggestion of a fat philistinism lingers along that path of progress like an assertion of a possessory right." Sorry about that, but I always had a weakness for Bierce's finer efforts. Some of them are already very hard to obtain through normal channels. How many other books are vanishing beyond the mandibles of silverfish as we discuss all this? And some of the material isn't even as melodramatically rhetorical as my diatribe. Cheers, Jon From Bowerbird at aol.com Thu Jul 31 15:57:09 2008 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 31 Jul 2008 18:57:09 EDT Subject: [gutvol-d] !@!ACTA trade agreement brief for July 29-31 Washington Message-ID: jon richfield said: > Not that I think that he should be so uncharitably cynical of course, > but a bad upbringing will out. you funny... :+) -bowerbird ************** Get fantasy football with free live scoring. Sign up for FanHouse Fantasy Football today. (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/d7364838/attachment.htm From hart at pglaf.org Thu Jul 31 16:47:02 2008 From: hart at pglaf.org (Michael Hart) Date: Thu, 31 Jul 2008 16:47:02 -0700 (PDT) Subject: [gutvol-d] QWQRe: !@!ACTA trade agreement brief for July 29-31 Washington In-Reply-To: <48920050.6070004@telkomsa.net> References: <48920050.6070004@telkomsa.net> Message-ID: yOn Thu, 31 Jul 2008, Jon Richfield wrote: > Sorry Michael, this feedback is probably late and certainly plaintive > rather than constructive. But you DID ask! so, fwiw: I reply with the below still intact because so much of it is worthy of looking at, and may have been overlooked before. "Intellectual Property Brigade". . .about as good as WIPO, or "World Intellectual Property Organization." You might pass on to your son that WIPO is descended from "The Stationers Guild" and then "The Stationers Company," who drafted the original of copyright law as we know it-- directly in response to The Gutenberg Press which ruined, from The Stationers POV, the monopoly they had had, since time immemorial. . . . I have more more to pass on to him, if he wants, and I am hoping he will take a look at my blog as well. BTW, if a file is found there that did NOT pass the spellcheck I am hoping you or he will point it out. I lost it, and would replace it, if I could only remember which one it was. /// As far as Ambrose Bierce goes, I am as great a fan as any I know, but I cannot agree that Mr. Munniglut has a right to mistreat anyone, cheat anyone, defraud anyone, etc. in his effort to pay for the upkeep of his family, put those kids through college, etc., etc., etc. The Munnigluts make it sound so peaceful and proper, when they say they "owe it to their shareholders" to screw the world at large. No sire, I do not agree, no matter how royal you may be. It is obvious to anyone who looks that the result of each of the various copyrights and copyright extensions has to be considered MORE as the destruction of a public domain, and LESS the actual increased profit to the booksellers. After all, the ONLY things still selling all that well in the extension periods are the best of the best sellers; a law the removes the public domain from everyone, just for a few percent more profits to those who have already made the greatest profit seems all too much Reverse Robin Hood And His Merry Men. . .if you take my meaning. This is what happens when you let business be government. It seem to be just the opposite of The Magna Carta, which was Project Gutenberg's eBook #10,000 for a good reason. Well, enough for now, but I hope you will encourage a son and/or other family members to further the conversation. Michael > My son, who has strong views on the subject said, in reply to my passing > on your message: > ====================== > > This sort of thing is going on with nauseating regularity, alas. > > The trouble behind it is that: > a) a lot of politicians have accepted without review the claims of the > intellectual property brigade > b) a lot of companies have discovered that while outright bribery is > illegal, politically sensitive gifts work wonders from a policy > standpoint. > > This is a classic protectionist move, really. All the old brands of > the first world want protection from knockoffs from elsewhere, and > they're trying to get the government to do their work for them. > > > ====================== > > Not that I think that he should be so uncharitably cynical of course, > but a bad upbringing will out. > Anyway, the sum is that some people think that in this way they might > make more money by depriving the public of things that they had been > entitled to; who are we to disillusion them by frustration? Obscurely > I am reminded of something Bierce said: > "...I must take the liberty to remind him that the law > of supply and demand is not imperative; it is not a statute but a > phenomenon. He may reply: "It is imperative; the penalty for > disobedience is failure. If I pay more in salaries and wages than I need > to, my competitor will not; and with that advantage he will drive me > from the field." If his margin of profit is so small that he must eke it > out by coining the sweat of his workwomen into nickels I've nothing to > say to him. Let him adopt in peace the motto, "I cheat to eat." I do not > know why he should eat, but Nature, who has provided sustenance for the > worming sparrow, the sparrowing owl and the owling eagle, approves the > needy man of prey and makes a place for him at table." > > It might strike anyone that I draw a strained comparison between the > exploiter of women in employment and exploiters of "intellectual > property" that they largely had no hand in creating, and far more > largely doom to oblivion simply by keeping them out of circulation, > rather than the lesser crime of profiting from printing what is neither > logically nor honestly theirs. > True no doubt, but the motto "I cheat to eat." springs nimbly to mind. > That they should compound parasitism with dog-in-the-mangerism is > distasteful rather than astonishing. People who have noted some of the > titles that I have provided either for Gutenberg US or AU, might be > slightly puzzled at my choices, but they embody a strong trend towards > worthwhile books that are little known and out of print. > Such legislation tends to the total loss of such books. That loss is a > loss to society. It adds nothing to the material gain for parasites who > couldn't give a damn either way if it means no money for them one way or > the other, so I don't waste my breath on them. Some books are of direct > value because of their content, and their loss may be loss as such. > Others are losses because they have value in their relevance to the > study of ideas in their times and communities. This too is a loss, so > pleading that only worthless materials will fail to get published is > unworthy. What we have here is the veto of public good in the interests > of greed and sloth. > > Maybe what we need is some sort of register of titles to which parties > might submit lists of material that they desire to publish on a > not-for-profit basis. Then if someone objects because they have both > the right and intention to publish it commercially instead, their right > prevails. Otherwise we innocents could publish those titles for the > benefit of readers who are not in a position to inflate the coffers of > that good Mr. Munniglut that Bierce referred to in the work I quoted: > "...contentedly smoothing the folds out of the superior slope of his > paunch, exuding the peculiar > aroma of his oleaginous personality and larding the new roadway with > the overflow of a righteousness stimulated to action by relish of his > own identity. And ever thereafter the subtle suggestion of a fat > philistinism lingers along that path of progress like an assertion of a > possessory right." > > Sorry about that, but I always had a weakness for Bierce's finer > efforts. Some of them are already very hard to obtain through normal > channels. How many other books are vanishing beyond the mandibles of > silverfish as we discuss all this? > > And some of the material isn't even as melodramatically rhetorical as my > diatribe. > > Cheers, > > Jon > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From lee at novomail.net Thu Jul 31 18:32:28 2008 From: lee at novomail.net (Lee Passey) Date: Thu, 31 Jul 2008 19:32:28 -0600 Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: <488FB357.1080307@novomail.net> References: <488FB357.1080307@novomail.net> Message-ID: <489267AC.3060105@novomail.net> Lee Passey wrote: > John Vandenberg wrote: [snip] >> Any existing code around to do something like this ? > > Yes. > > I have created some code to do this, which I would be happy to share > with you, but I'm hoping someone else has done it better. I'm currently > checking out HTML Match (http://www.htmlmatch.com/) which claims that it > is able to "ignore the source code and compare only the text content of > the web pages." If you're interested, I'll report back on what I find. OK, for what I am trying to accomplish, I must report that HTML Match sucks. So far, I still haven't found any FOSS program, other than GNU diff, which I can leverage. So Carlo, have you successfully used wdiff to merge presumably clean text into an XML file? To be honest, I'm thinking that dwdiff, with its ability to set characters which should be word delimiters may be the answer. From traverso at posso.dm.unipi.it Thu Jul 31 22:52:49 2008 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Fri, 1 Aug 2008 07:52:49 +0200 (CEST) Subject: [gutvol-d] getting my wikisource bearings In-Reply-To: <489267AC.3060105@novomail.net> (message from Lee Passey on Thu, 31 Jul 2008 19:32:28 -0600) References: <488FB357.1080307@novomail.net> <489267AC.3060105@novomail.net> Message-ID: <20080801055249.75FCB93B61@posso.dm.unipi.it> >>>>> "Lee" == Lee Passey writes: Lee> So Carlo, have you successfully used wdiff to merge Lee> presumably clean text into an XML file? To be honest, I'm Lee> thinking that dwdiff, with its ability to set characters Lee> which should be word delimiters may be the answer. I haven't merged text into XML recently, but in these cases I filter out the markup and compare the resulting text. I know how to modify a tool that I wrote a few years ago to find and merge differences at the character level to allow merging back text corrections into marked-up text, but I have never finished it. Carlo From bzg at altern.org Thu Jul 31 20:10:59 2008 From: bzg at altern.org (Bastien Guerry) Date: Fri, 01 Aug 2008 05:10:59 +0200 Subject: [gutvol-d] !@! Re: woman in her own right -- 008 (and final) In-Reply-To: (Michael Hart's message of "Thu, 31 Jul 2008 10:50:14 -0700 (PDT)") References: <4891EF2C.7020309@xs4all.nl> Message-ID: Michael Hart writes: > So, once again, I simply point out that if you don't want a > contact with bowerbird. . .which you all SAY. . .all you do > is start your own listserver and don't let him in, or use a > heavy hand on "moderation" if you do let him in. > > These solutions are simple. I'm not in favor of moderation. But it's not that easy to build another list. If I build another list, I want people to know about this, and I will surely send an email here, because I believe the gutvol-d list attracted many interesting people. How then can I be sure that the one doing that much noise on this list will not join the new list under another name? Ignoring noise is always possible, but it requires a lot of energy. I think people would prefer to spent this energy on discussing things in a more constructive way. Anyway. -- Bastien