From richfield at telkomsa.net Sun Jul 1 03:29:22 2007 From: richfield at telkomsa.net (Jon Richfield) Date: Sun, 01 Jul 2007 12:29:22 +0200 Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious. Message-ID: <46878202.4080201@telkomsa.net> Re: Sorry folks, but I seem to have missed the obvious. Thanks to all who replied. If I don't sound overwhelmed, that is purely because I am by now inured to your standards of helpfulness, and therefore am not surprised at your patience. In order of convenience: First, BB. > you should e-mail a whitewasher. i'll backchannel you an e-address.< Thanks, but I'll take a rain check on that one. My problem is not so much that I want to stir up the antnest, but that I was not sure that I had got my prepared material through to the ants at all, and if not, why not. Next Josh. >1st - Did you get a copyright clearance on the books before you started?< Well, sorta-kinda. As I understand it, I do not need to get clearance for items that have already been cleared. (e.g. Bindle, fragments of science, and Just William). Secondly, it seems nutty to get clearance on e.g. Practical taxidermy, which seems to date from the 1880s, though its TP&V show no date (I had to deduce it from the text. Big novelty, hm?) Thirdly, I included the TP&Vs for the others in the ZIP files in which I sent them. Also, when I despatched them, using the web page allocated for the purpose (can't remember the details, but it was all very proper, with my nice new password etc) it only complained about one of the books I tried to send, and the way it did that was by insisting on getting the clearance code before letting me send it. (I think that was the entomological dictionary, or possibly Practical taxidermy.) So I decided (which I have just discovered by finger trouble, to be nearly the same spelling as decoded, which would have confused the issue) to call it a night and wait to see what happened to the others. So far nothing. Hence my screaming for the better business bureau. >2nd - Word97 isn't a format we support. If it is a simple text, it is fairly simple to convert it to a standard text file, but you may want to do that in the future yourself so that you can make sure it "looks right" in its final form. Especially if you're going to be doing a lot of books (which it sounds like you are), you'll want to do that (as well as use tools like GutCheck) so that your text is as close to "finished" as possible when you upload it.< Yesss... well, it isn't so simple. (It never is, isn't it? (Had to slip that one in before you said it!)) Bindle and William and a lot of the vanilla fiction and philosophy do very well in TXT form (except that the TOC doesn't add much value, but that does not matter much in machine readable form, given that most reading software permits a search function of some form.) Unfortunately, much mathematical and other scientific material is simply incomprehensible in TXT format. Pictures are a trifle itchy too. I don't mind omitting say the illustrations to the William books or "Child of the Deep", though a purist might object, and other purists (including myself, nearly) can hardly imagine Carroll without Tenniel, but books on science, such as Lubbock's "Senses of Insects" (also 19th century, amazingly) are almost useless without their illustrations, but are invaluable without them. And, please note, some of these are truly great, nowadays badly neglected, books. Now, all that is obvious to most of us, but less obvious is a book such as Fowler's "The King's English". My copy is in good condition (actually, it is my wife's, which partly explains that) and the scanner loved it, so I thought that preparing the final text from the scanned material would be a doddle. That despairing moan that you were wondering about a few weeks ago came from this end of the planet. It was the hardest book I have worked on yet, and that is saying something (though, heaven help me, I am casting wistful eyes at some that bid fair to be worse!) Firstly, it is the first book where I really do need the TOC and the index, which mean that the pagination matters. In TXT files that is a nuisance at best, though it is not a show-stopper. However, the Fowlers' text formatting is fairly parsimonious, but highly significant in semantic terms, which means that any re-formatting would be prohibitive. When those blighters used italics, they meant it! The text would be nearly useless and completely maddening if the italics were not visible. Checking on that took me weeks, for something hardly larger than a booklet! This is one book that I did not even *bother* to convert to TXT format, even though it contained neither formulae nor illustrations, just a very little Greek, which I entered manually, and could have managed in a TXT file. >3rd - Where did you send the file? There are specific steps and locations to go to upload a new etext, but it's possible you got turned around and sent it somewhere that rarely (or even never) gets checked by anyone that could help things along.< Well, I cannot remember the details, but I got myself an ID and a password through the PG channels, and submitted the files that it would accept, as I described above. Years ago I simply emailed stuff directly to MH, but I see that things have changed since then. >Normally, a "finished" etext usually gets posted within a couple days of uploading it, so it sounds like there is something else going on here. Finding out what exactly you've done so far ought to help us track down where the pipe got clogged!< Right, hence my coming forward with cries of peccavi! :-S >PS If the final cleaning steps to get it ready are more work than you want to do, you may want to see about just scanning the books, then running them through Distributed Proofreaders (www.pgdp.net). They've go lots of folks willing to help out at all stages.< Thanks to you and them, and no doubt I shall make some use of them in future, but for some of the books I have been working on, I think it is unnecessary, while for others I prefer to do the whole thing. Simply scanning the visual material (say to the stage of the .OPD files) is much, much easier, but I get the impression that I would simply be adding more to a mountain of undone work. Conversely, since I have the source material I do not have to go to the lengths of precision of scanning that Jon Noring proposes, so there may be some sense to my taking work to at least near-completion. I am not yet certain how far to take all this, but none of it is as important as getting to a point of successful submission, and knowing when I have succeeded. Sankar wrote: >I do not see any of your books being uploaded or posted.< Thanks. That I needed to know. >You may remember that I had advised you the steps for uploading a book. Later on Joe and myself had advised you about the clearance line. < Yes, but I hope what I wrote above makes it clear (or a bit clearer anyway) where I have erred. >Service to Humanity is Service to God< That works for me. Or at any rate, it will when I can get it to work! :-) OK folks, thanks for your trouble so far. Suggestions and corrections welcome. For one thing, given that I have this problem with texts that are not adequately served by TXT files, and that there is an understandable distaste for Word, what can I do? I seem to remember that there is some dissatisfaction with the format in which Word produces HTM documents, though *for the most part* it looks reasonable to me. In short, can anyone prescribe the best way to get from Word to civilisation without investing in a lot of extra software? Cheers Jon From prosfilaes at gmail.com Sun Jul 1 06:46:00 2007 From: prosfilaes at gmail.com (David Starner) Date: Sun, 1 Jul 2007 08:46:00 -0500 Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious. In-Reply-To: <46878202.4080201@telkomsa.net> References: <46878202.4080201@telkomsa.net> Message-ID: <6d99d1fd0707010646ve5a1831kcd73b9745d066e59@mail.gmail.com> On 7/1/07, Jon Richfield wrote: > Well, sorta-kinda. As I understand it, I do not need to get clearance > for items that have already been cleared. (e.g. Bindle, fragments of > science, and Just William). Secondly, it seems nutty to get clearance > on e.g. Practical taxidermy, which seems to date from the 1880s, though > its TP&V show no date (I had to deduce it from the text. Big novelty, > hm?) Thirdly, I included the TP&Vs for the others in the ZIP files in > which I sent them. No, you need to get clearance on everything. Clearance does two things: first, it's the way we co?rdinate who's working on what. Secondly, it's the way that PG verifies and can prove that the _editions_ (not just the books) are in the public domain; if someone claims that the Practical Taxidermy that you worked from was printed in 1925 and they own the copyright, we need more than just "seems to date from". From joshua at hutchinson.net Sun Jul 1 07:00:42 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Sun, 1 Jul 2007 14:00:42 +0000 (UTC) Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious. Message-ID: <15708130.1183298442924.JavaMail.?@fh1037.dia.cp.net> As David pointed out in another reply, the clearances are absolutely essential. We have to have them to cover our butts, if for no other reason. And much of the stuff you talk about CAN be done in a txt file using some standard conventions. i.e., italics are set off with _underlines_ around the word. Chemical and mathematical formula can be done with underscores and carets (H_{2} O or a^2 + b^2 = c^2) Illustrations, can have placeholders in the text like this: [Illustration: This is the caption below one illustration] And if you do a HTML version (highly recommended) then you can put the images directly inline in the text. Avoid Word at every stage. It'll just cause you grief in the long run, because it doesn't really do ANYTHING the way it needs to be done for a PG text. Generally, you'll need to read up in the FAQ for formatting information, which is where I'm guessing you have the most work to do. Good luck, Josh >----Original Message---- >From: richfield at telkomsa.net >Date: Jul 1, 2007 6:29 >To: >Subj: Re: [gutvol-d] Sorry folks, but I seem to have missed the obvious. > > > >Re: Sorry folks, but I seem to have missed the obvious. > >Thanks to all who replied. If I don't sound overwhelmed, that is purely >because I am by now inured to your standards of helpfulness, and >therefore am not surprised at your patience. > >In order of convenience: > >First, BB. > > > you should e-mail a whitewasher. i'll backchannel you an e-address. < > >Thanks, but I'll take a rain check on that one. My problem is not so >much that I want to stir up the antnest, but that I was not sure that I >had got my prepared material through to the ants at all, and if not, why >not. > >Next Josh. > > >1st - Did you get a copyright clearance on the books before you > >started?< > >Well, sorta-kinda. As I understand it, I do not need to get clearance >for items that have already been cleared. (e.g. Bindle, fragments of >science, and Just William). Secondly, it seems nutty to get clearance >on e.g. Practical taxidermy, which seems to date from the 1880s, though >its TP&V show no date (I had to deduce it from the text. Big novelty, >hm?) Thirdly, I included the TP&Vs for the others in the ZIP files in >which I sent them. > >Also, when I despatched them, using the web page allocated for the >purpose (can't remember the details, but it was all very proper, with my >nice new password etc) it only complained about one of the books I tried >to send, and the way it did that was by insisting on getting the >clearance code before letting me send it. (I think that was the >entomological dictionary, or possibly Practical taxidermy.) So I decided >(which I have just discovered by finger trouble, to be nearly the same >spelling as decoded, which would have confused the issue) to call it a >night and wait to see what happened to the others. So far nothing. > >Hence my screaming for the better business bureau. > > >2nd - Word97 isn't a format we support. If it is a simple text, it is >fairly simple to convert it to a standard text file, but you may want to >do that in the future yourself so that you can make sure it "looks >right" in its final form. Especially if you're going to be doing a lot >of books (which it sounds like you are), you'll want to do that (as well >as use tools like GutCheck) so that your text is as close to "finished" >as possible when you upload it.< > >Yesss... well, it isn't so simple. (It never is, isn't it? (Had to slip >that one in before you said it!)) Bindle and William and a lot of the >vanilla fiction and philosophy do very well in TXT form (except that the >TOC doesn't add much value, but that does not matter much in machine >readable form, given that most reading software permits a search >function of some form.) Unfortunately, much mathematical and other >scientific material is simply incomprehensible in TXT format. Pictures >are a trifle itchy too. I don't mind omitting say the illustrations to >the William books or "Child of the Deep", though a purist might object, >and other purists (including myself, nearly) can hardly imagine Carroll >without Tenniel, but books on science, such as Lubbock's "Senses of >Insects" (also 19th century, amazingly) are almost useless without their >illustrations, but are invaluable without them. And, please note, some >of these are truly great, nowadays badly neglected, books. > >Now, all that is obvious to most of us, but less obvious is a book such >as Fowler's "The King's English". My copy is in good condition >(actually, it is my wife's, which partly explains that) and the scanner >loved it, so I thought that preparing the final text from the scanned >material would be a doddle. That despairing moan that you were >wondering about a few weeks ago came from this end of the planet. It >was the hardest book I have worked on yet, and that is saying something >(though, heaven help me, I am casting wistful eyes at some that bid fair >to be worse!) Firstly, it is the first book where I really do need the >TOC and the index, which mean that the pagination matters. In TXT files >that is a nuisance at best, though it is not a show-stopper. However, >the Fowlers' text formatting is fairly parsimonious, but highly >significant in semantic terms, which means that any re-formatting would >be prohibitive. When those blighters used italics, they meant it! The >text would be nearly useless and completely maddening if the italics >were not visible. Checking on that took me weeks, for something hardly >larger than a booklet! This is one book that I did not even *bother* to >convert to TXT format, even though it contained neither formulae nor >illustrations, just a very little Greek, which I entered manually, and >could have managed in a TXT file. > > >3rd - Where did you send the file? There are specific steps and >locations to go to upload a new etext, but it's possible you got turned >around and sent it somewhere that rarely (or even never) gets checked by >anyone that could help things along.< > >Well, I cannot remember the details, but I got myself an ID and a >password through the PG channels, and submitted the files that it would >accept, as I described above. Years ago I simply emailed stuff directly >to MH, but I see that things have changed since then. > > >Normally, a "finished" etext usually gets posted within a couple days >of uploading it, so it sounds like there is something else going on >here. Finding out what exactly you've done so far ought to help us >track down where the pipe got clogged!< > >Right, hence my coming forward with cries of peccavi! :-S > > >PS If the final cleaning steps to get it ready are more work than you >want to do, you may want to see about just scanning the books, then >running them through Distributed Proofreaders (www.pgdp.net). They've go >lots of folks willing to help out at all stages.< > >Thanks to you and them, and no doubt I shall make some use of them in >future, but for some of the books I have been working on, I think it is >unnecessary, while for others I prefer to do the whole thing. Simply >scanning the visual material (say to the stage of the .OPD files) is >much, much easier, but I get the impression that I would simply be >adding more to a mountain of undone work. Conversely, since I have the >source material I do not have to go to the lengths of precision of >scanning that Jon Noring proposes, so there may be some sense to my >taking work to at least near-completion. I am not yet certain how far >to take all this, but none of it is as important as getting to a point >of successful submission, and knowing when I have succeeded. > >Sankar wrote: > > >I do not see any of your books being uploaded or posted.< > >Thanks. That I needed to know. > > >You may remember that I had advised you the steps for uploading a >book. Later on Joe and myself had advised you about the clearance line. < > >Yes, but I hope what I wrote above makes it clear (or a bit clearer >anyway) where I have erred. > > >Service to Humanity is Service to God< > >That works for me. Or at any rate, it will when I can get it to work! :-) > >OK folks, thanks for your trouble so far. Suggestions and corrections >welcome. For one thing, given that I have this problem with texts that >are not adequately served by TXT files, and that there is an >understandable distaste for Word, what can I do? I seem to remember >that there is some dissatisfaction with the format in which Word >produces HTM documents, though *for the most part* it looks reasonable >to me. In short, can anyone prescribe the best way to get from Word to >civilisation without investing in a lot of extra software? > >Cheers > >Jon > > > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Sun Jul 1 14:06:29 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 1 Jul 2007 17:06:29 EDT Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious. Message-ID: jon richfield said: > Thanks, but I'll take a rain check on that one.? > My problem is not so much that I want to stir up the antnest, > but that I was not sure that I had got my prepared material > through to the ants at all, and if not, why not.?? i wasn't suggesting you should "stir up the antnest", merely telling you that they can answer your questions about whether your material got through, and we can't. :+) -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070701/6cb99705/attachment.htm From brynnahlld at yahoo.com Sun Jul 1 15:06:45 2007 From: brynnahlld at yahoo.com (Elisa) Date: Sun, 1 Jul 2007 15:06:45 -0700 (PDT) Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious. In-Reply-To: Message-ID: <795752.17285.qm@web52603.mail.re2.yahoo.com> >As I understand it, I do not need to get clearance >for items that have already been cleared. (e.g. Bindle, fragments of >science, and Just William). Quite the contrary. When something's been cleared, that means that someone else has expressed the intention of working on it, so you need to make sure that effort isn't being duplicated. (Just William, for example, is in post-processing at Distributed Proofreaders, so most of the work on it has already been done.) However, quite often people will get a clearance, and then abandon the project, so it won't hurt to ask for clearance, just to make sure the other person still has an interest in the project. >This is one book that I did not even *bother* to >convert to TXT format My understanding was that all PG books *must* have a TXT format if that's at all possible. There's standard conventions for _italics_ and =bold=. I'd recommend hiding Word97 and checking out a basic freeware text editor, using none of the 'extras' like italics and font changes, . EditPadLite works well for me for programming files, so I know it isn't adding unseen cruft to the files. --------------------------------- Need Mail bonding? Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070701/8aae5974/attachment.htm From gbnewby at pglaf.org Sun Jul 1 20:57:22 2007 From: gbnewby at pglaf.org (Greg Newby) Date: Sun, 1 Jul 2007 20:57:22 -0700 Subject: [gutvol-d] Fwd: Educated Earth Website / Donation to PG (fwd) Message-ID: <20070702035722.GA16513@mail.pglaf.org> Has anyone seen the www.educatedearth.net site in action? (We sent Ben info about sending us money) >[ben at educatedearth.net - Sun Jul 01 13:03:31 2007]: > >Hio. My name is Ben Lovatt, I'm the owner of a humanitarian >science/technology website called EducatedEarth ( >http://www.EducatedEarth.net ). We raise money in donations (in addition >to 10% of our profits) and give them to a different organization every >month. To decide which organization should receive the money, we have our >members give us suggestions on companies and we let viewers of our site >vote on where to donate it. Project Gutenberg has been nominated and is in >this month's poll. You're welcome to encourage your staff and website >visitors to vote for you. > >If your organization was to win, how would I send this money to you? > >Thanks, >Ben Lovatt From Bowerbird at aol.com Mon Jul 2 10:53:28 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 2 Jul 2007 13:53:28 EDT Subject: [gutvol-d] z.m.l. examples reloaded Message-ID: i have had some z.m.l. example books up in various locations, at various times, but i've reloaded them at my z-m-l.com site... these samples are solid proof-of-concept of z.m.l. usefulness. (i relate _why_ i've adopted each of these books as an example; do note that none of the examples were cherry-picked by me.) they demonstrate z.m.l. at the stage of "continuous proofing", where the text for every page is displayed alongside its scan, so the general public can check and report possible errors... here's "books and culture", from hamilton wright mabie: > http://www.z-m-l.com/go/mabie/mabiep123.html this was google's first revealed public-domain book. here's "the secret garden", by frances hodgson burnett: > http://www.z-m-l.com/go/sgfhb/sgfhbp123.html this was a book from the p.g. library that was _redone_. here's "my antonia", written by willa cather: > http://www.z-m-l.com/go/myant/myantp123.html this was a book that jon noring used as his example. here's "a hacker manifesto", from mckenzie wark: > http://www.z-m-l.com/go/ahmmw/ahmmwp012.html this was a book that the if:book people recommended. here's "the open library", by brewster kahle: > http://www.z-m-l.com/go/tolbk/tolbkp012.html this book details the philosophy of the open content alliance. *** and, for another manifestion of z.m.l. in action, see this page: > http://www.z-m-l.com/go/vl3.pl this demo shows how the "no-markup" z.m.l. "master" files can be converted on-the-fly in real-time to .html versions. you can examine the z.m.l. file by clicking each link, and then generate its .html version by clicking each button... books included in this demo are: > a test-suite for project gutenberg e-texts > a document listing the 11 rules of zen markup language > a presentation given by cory doctorow at microsoft > "a christmas carol", by charles dickens > "lady clare", by alfred tennyson > "the lady of shalott", by alfred tennyson > "the mysteries of the caverns", by roger finlay > "fort amity", by arthur thomas quiller-couch > "the master-knot of human fate", by ellis meredith > "the tragedy of pudd'nhead wilson", by mark twain *** more samples will follow soon... -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/393c6552/attachment.htm From Bowerbird at aol.com Mon Jul 2 11:14:57 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 2 Jul 2007 14:14:57 EDT Subject: [gutvol-d] i decided to think about the future, not the past Message-ID: there was too much chatter for me to proceed on friday, especially with the iphone changing the world that day by putting the internet in your pocket, so i decided think about the _future_, not the past... but i will continue my series on d.p. efficiency with a post on item #3, preprocessing, later today. meanwhile, i had already written up this advance preview on item #4, coming next tuesday, for you to ponder. here's a revealing factoid... for literally _years_, until just _this_ spring, distributed proofreaders did not offer to its proofers -- who were correcting the o.c.r. text -- the basic spellcheck functionality of adding a word to the dictionary. you know, when you're doing your spellcheck, it tells you "word not found" and gives you one or more suggestions, as well as some other options, typified by this screenshot: > http://www.z-m-l.com/go/scinterface.png (this is the word-not-found dialog from the mac's textedit.) in this screenshot, it's telling me it can't find "bowerbird" in its english dictionary, and it makes one suggestion (bower-bird), and allows me to choose whether to (a) ignore it, or (b) guess (presumably generate a suggestion, i would guess, just a guess), or (c) find next, or (d) correct it as it was edited in the textbox... _plus_ at bottom right, two more buttons -- "forget" and "learn". the ability to "forget" a word easily via a mere button-click is nice -- you don't get that option up-front in very many spellcheckers -- but let's focus on what our focus has been on: adding a new word. the d.p. spellchecker just didn't have that "add" option on it. really! it couldn't "learn" a new word. not one. let alone forget it after that, it couldn't even learn it in the first place. pretty sad, don't you think? well, turns out this failure to be able to learn had some consequences. indeed, some rather serious and ugly consequences. for instance, it meant that the _names_ of a book's characters -- which will come up very frequently over the course of the story -- could not be added to the dictionary to remove their flags globally. so, for the example of "my antonia", the main character's name of "antonia" was flagged each and every time it appeared in the book. and not just that name, but a bunch of _other_ names in the book. ouch. that's a whole lotta flags you'd have to ignore. flag overload. makes it hard to pick out the real problems amidst all of the fakes... and hey, maybe you don't think that's too bad, because it's easy to see at a quick glance whether "antonia" was recognized correctly... but what about the names i've appended?, all from e-text #13603, "the hawaiian romance of laieikawai", chock full of hawaiian names... those are all the names you have to check, some with _very_many_ occurrences; after a while your head be swimming like it's in maui, with paris hilton recuperating from her recent trip to the county jail. how'd you like to proof those papayas? and once you had checked one of these nightmare names, you'd want to add it to the dictionary so -- at least if it came up _exactly_ the same -- it would _not_ be flagged again, so you'd know you didn't have to do the plow-through. i know _i_ would sure be appreciative... and i bet there were probably also all kinds of other hawaiian words that showed up with quite high frequencies in that book, and thus caused proofers unnecessary pain because they couldn't be added to the book's dictionary so that they wouldn't continue to be flagged. yet for _years_ d.p. proofers went without this _core_ functionality... it was a _bad_ situation... and it was allowed to drag on for _years_... to me, that simple fact communicates a _world_ of disrespect for the proofers who're volunteering their time and energy to help the cause. a spellchecker is probably _the_ most valuable tool there is to discover scannos made in the o.c.r. process, yet d.p. provided its proofers with a substandard version of the tool, one that was clearly inferior. shame. they kept saying it was "a shortage of programmers", but meanwhile they seemed to have enough "programming help" to make all kinds of modifications to their system, thoroughly up-ending the whole thing, going from two rounds to four, and then to five, separating proofing from formatting, etc. but yet, they couldn't fix a major broken tool... there's something wrong in the priorities there. and _badly_ wrong... oh, they've fixed this terrible shortcoming, finally, to get a little credit, but they put in its place an unwieldy version of the tool, such that only the project manager can "add" a word to the dictionary for each book; the proofers themselves -- who have to bear the brunt of false flags -- evidently cannot be "trusted" with such a decision. it's really very sad... i'll have more to say on this general topic. but keep these aloha names in your aloha mind as an aloha vivid exemplar of my overall aloha point... you say goodbye, and i say hello... -bowerbird > Achatinella Ahewahewa Aholenuimakiukai Aikanaka Aiohikupua Aiwohikupua Akahiakuleana Akanikolea Akikeehiale Alelekinakina Alihikaua Aukelanuiaiku Aukelenuiaiku Aukuuikalani Hakalanileo Halaaniani Halauoloolo Haleakala Halealii Halehuki Halemano Haleolono Halepaahao Halepaki Haloalena Haluluikekihiokamalama Hamakualoa Hanaaumoe Hanamaulu Hanapepe Hanualele Hauailiki Hauikalani Hawaiiakea Heakekoa Hekilikaakaa Hikapoloa Hilopaliku Himatione Hinaaikamalama Hinaaimalama Hinaakeahi Hinaikainalama Hinaikamalama Hinakahua Hinaluaikoa Hinaluaimoa Hinapaleaoana Hinauluohia Hinawaikolii Hiwahiwa Hoamakeikekula Hokiolele Hokuhookelewaa Holaniku Holoholoku Holualoa Honehone Honokaape Honokalahi Honokalani Honolahau Honolohau Honopuuwaiakua Honopuwai Honopuwaiakua Honouliuli Honuaula Hoohokukalani Hookaakaaikapakaakaua Hookamumu Hookeleiholo Hookeleipuna Hooleipalaoa Hoolilimanu Hoomakaukau Hualalai Huawaiakaula Huliamahi Hulihonua Hulumaniani Kaahualii Kaawaloa Kaawikiwiki Kaehaikiaholeha Kaelehuluhulu Kaeloikamalala Kaeloikamalama Kahakaauhae Kahakaekaea Kahakuikamoana Kahalaoaka Kahalaokolepuupuu Kahalaomapuana Kahalaopuna Kahalapmapuana Kahaookamoku Kahapaloa Kahaumana Kahauokapaka Kahawalea Kaheawai Kahekili Kahihikolo Kahikihonuakele Kahikikolo Kahikiku Kahikimoe Kahikinui Kahikiula Kahioamano Kahoiwai Kahoolawe Kahoupokane Kaialeale Kaihalulu Kaihuopalaai Kaiimamao Kaikamahine Kaikilani Kaikipaananea Kaikuahine Kaikunane Kailiokalauokekoa Kaipalaoa Kaipolohua Kaiwakaapu Kaiwilahilahi Kaiwiopele Kakaalaneo Kakahaekaea Kakakauhanui Kakalukaluokewa Kakuhihewa Kalaehina Kalaekini Kalaeloa Kalaepuni Kalahumoku Kalakaua Kalamaula Kalanialiiloa Kalaniamanuia Kalanikilo Kalanilonoakea Kalanimanuia Kalaniopuu Kalapana Kalapanakuioiomoa Kalaumeki Kalaupapa Kaleikini Kalelealuaka Kalenaihaleauau Kalewalo Kalokuna Kalonaikahailaau Kalopulepule Kaluapalena Kaluawilinae Kamaainau Kamaakamikioi Kamaakauluohia Kamahaina Kamahualele Kamaikaakui Kamakaaulani Kamakaiwa Kamakulua Kamalalawalu Kamalama Kamanuwai Kamapuaa Kamehamaha Kamehameha Kamelekapu Kamoamoa Kamohoalii Kamooinanea Kamooloa Kanaloakuaana Kaneapua Kaneaukai Kanehunamoku Kaneikamikioi Kanenaiau Kanepohihi Kaneulohia Kaneulupo Kanewahineikiaoha Kanikaea Kanikaniaula Kanikapiha Kanikawi Kanoakapa Kaohukolokaialea Kaoleioku Kaonohiokala Kapaahulani Kapahaelihonua Kapahielihonua Kapaihiahilani Kapakohana Kapalilua Kapapaapuhi Kapapaiakea Kapepeekauila Kapuaokaohelo Kapuaokaoheloai Kapuheeuanui Kapuhiokalaekini Kapukaihaoa Kapunohu Kapunokaoheloai Karolineninsel Kauaiapuni Kauakahialii Kauakuahine Kaukaalii Kaukaukamunolea Kaukihikamalama Kaulaailehua Kaulanaikipokii Kaulanapokii Kaululaau Kaumaielieli Kaumailiula Kaumakapili Kaumalumalu Kaunakahakai Kaunakakai Kaunalewa Kauwilanuimakehaikalani Kawahineokaliula Kawaihae Kawaipapa Kawalakii Kawaluna Kawaomaaukele Kawaunuiaola Kaweleau Keahumoa Keakahulilani Keakamilo Kealakaha Kealakekua Kealohikikaupea Kealohilani Keanapou Keaomelemele Keauleinakahi Keaulumoku Keawanui Keaweikekahialii Keawenuiaumi Keinohoomanawanui Kekaihawewe Kekalukaluokewa Kekalukaluokewaii Kekuhaupio Keleanuinohoonaapiapi Keliimalolo Keliiokaloa Keliiomakahanaloa Kenaloakuaana Kenntniss Keoneoio Kepakailiula Kepapaialeka Kihanuilulumoku Kihapiilani Kihawahine Kiimaluhaku Kikekaala Kilioopu Kilohana Kilokilo Kipahulu Kipapalaula Kipapalaulu Kipunuiaiakamau Koeniglichen Kohalalele Kohalaomapuana Koholalele Konikonia Kookoolau Koolauloa Koolaupoko Kosmogonie Kotzebue Kuaihelani Kuamooakane Kuapakaa Kuauamoa Kuhukulua Kuhuluhulumanu Kuikauweke Kuililoloa Kuilioloa Kukailani Kukamaulunuiakea Kukaniloko Kukaohialaka Kukeapua Kukuikiikii Kukuipahu Kukululaumania Kulanihakoi Kulukulua Kumukahi Kumukena Kumuniaiake Kumun uiaiake Kupaahulani Kupololiilialiimualoipo Kupololiilialiimuaoloipo Kupukupukehaiaiku Kupukupukehaikalani Kupuupuu Kuwahailo Laamaikahiki Laamaomao Lahainaluna Laieikawai Laielohelohe Lalakeenuiakane Lamaloloa Lanalananuiaimakua Laniihikapu Lanikahuliomealani Lanikuakaa Lanioaka Lanipipili Lapakahoe Laukapalala Laukapalili Laukiamanuikahiki Laukieleula Laupahoehoe Lepeamoa Liliuokalani Liluokalani Lolomauna Longapoa Lonoapii Lonoikamakahike Lonoikamakahiki Lonoikiaweawealoha Lonoikoualii Lonokaeho Lonopili Luahinekaikapu Luakalai Lulukaina Lupewale Maakuakeke Macculloch Mahealani Maheleana Mahinanuikonane Mahukona Maiauhaalenalenaupena Mailehaiwale Mailekaluhea Mailelaulii Mailepakaha Makahanaloa Makaukiu Makaulanei Makaweli Makeweli Makiioeoe Makuakane Malaekahana Malaiakalani Malamanui Malanaihaehae Malanaikuaheahea Malekahana Malelewaa Manaiakalani Maniniholokuaua Mantandua Marianen Marquesan Marquesas Maunakalika Maunakea Maunalahilahi Maunalei Maunaloa Maunauna Meeresweiten Melanesian Melanesien Melbourne Micronesian Mikronesien Moahelehaku Moanaikaiaiwe Moanaliha Moanalihaikawaokele Moananuiakea Moananuikalehua Moaulanuiakea Moerenhout Mokuekelekahiki Mokuhano Mokukeleikahiki Mokukelekahiki Mokuleia Molokini Moloklni Monographie Monowaikeoo Nakinowailua Nakolowailani Namakaokahai Namakaokalani Namoeluk Nanakuli Nathaniel Naulukohelewalewa Nihoalaki Niuhelewai Nuumealani Oioiapaiho Okipoepoe Olekulua Omaokamao Omaokamau Oneoneihonua Opelemoemoe Opukahonua Pahulumoa Pakaalana Palalahuakii Paleaikaahalanalana Palikaulu Paupauwela Peleioholani Pelekunu Petroglyphs Pihanakalani Piihonua Piimaiwaa Piimaiwae Pikoiakaala Pikoiakaalala Pioholowai Piokeanuenue Pleiades Pohakuokauai Poliahu Polomauna Pomaikai Puaakukui Puaamaumau Puaatihaloa Pueonuiokona Puniaiki Pupuakea Pupualenalena Pupuhuluena Puuanahulu Puukohala Puukohola Puumahawalea Puumaneo Puuoaoaka Puuonale Puuopapai Puupuukaamai Uhumakaikai Ukumehame Uweuwelekehau Waawaaikinaanao Waawaaikinaaupo Wahilani Waiahole Waiahulu Waialala Waiapuka Waihalau Waiohonu Waiolama Waiopuka Waiulaula Walewale Wawaikalani ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/dd394c62/attachment-0001.htm From Bowerbird at aol.com Mon Jul 2 15:09:19 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 2 Jul 2007 18:09:19 EDT Subject: [gutvol-d] LoCC and Subject fields Message-ID: i said: > it's easy to "be found" in cyberspace if you play your cards right... > > play 'em wrong -- by using some library of congress stuff which > none of your end-users is hooked into -- and you'll be invisible. > > (this is _not_ to say that that stuff couldn't be of _some_ use, but > you'd have to make sure the cost-benefit ratio justified the work. > if you really want to pursue _that_ angle, then find a way to lower > the costs -- and the best suggestion i have for you there is to dig > into the amazon a.p.i. and find out if you can scrape info there -- > and to raise the benefits -- where the best suggestion i have for > _that_ is to get project gutenberg's e-texts pointed to by amazon, > and if you manage that, _then_ you will have accomplished much.) since jason's desired target was _libraries_, i should've said "worldcat" instead of "amazon", but the basic idea is exactly the same, of course. i'm kinda surprised that no one else responded to jason. is it true that you've all just given up on the p.g. catalog as being helpless? -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/b00e99a0/attachment.htm From shabam.dp at gmail.com Mon Jul 2 15:39:25 2007 From: shabam.dp at gmail.com (shabam) Date: Mon, 2 Jul 2007 15:39:25 -0700 Subject: [gutvol-d] Fwd: Educated Earth Website / Donation to PG (fwd) In-Reply-To: <20070702035722.GA16513@mail.pglaf.org> References: <20070702035722.GA16513@mail.pglaf.org> Message-ID: <1ac896090707021539j4578a778n70d3fab635bb989c@mail.gmail.com> Greg, Well, with 74 of 111 votes going to PG, I think PG is in. Be sure to let us know how much PG "wins". I'd never heard of them before, but this is a good way for them to get their name out there. They do not guarantee any amount to PG, so it could be $5 that the raise, but they get the organizations to tell their "Staff and visitors" to visit the site and vote for them. Interesting content. Looks like they are one of the many sites that look for interesting content to post on their site, maybe having some original content on occasion. They do need to update their catalog though. A lot of the UTube stuff came back saying "This video is no longer available" Ah well. If they get PG $5, then that is $5 PG would not have had otherwise. Overall, I'm not too impressed with the site, as so many of the videos I tried to watch were no longer available. They need some way to clean up these. Jason On 7/1/07, Greg Newby wrote: > > Has anyone seen the www.educatedearth.net site in action? > (We sent Ben info about sending us money) > > >[ben at educatedearth.net - Sun Jul 01 13:03:31 2007]: > > > >Hio. My name is Ben Lovatt, I'm the owner of a humanitarian > >science/technology website called EducatedEarth ( > >http://www.EducatedEarth.net ). We raise money in donations (in addition > >to 10% of our profits) and give them to a different organization every > >month. To decide which organization should receive the money, we have our > >members give us suggestions on companies and we let viewers of our site > >vote on where to donate it. Project Gutenberg has been nominated and is > in > >this month's poll. You're welcome to encourage your staff and website > >visitors to vote for you. > > > >If your organization was to win, how would I send this money to you? > > > >Thanks, > >Ben Lovatt > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > -- Person to person lending. Lend money to others, and get a $25 bonus. http://www.prosper.com/join/shabam -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/e9a62d39/attachment.htm From Bowerbird at aol.com Mon Jul 2 17:09:39 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 2 Jul 2007 20:09:39 EDT Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3 Message-ID: ok, to review, we've discussed the first two steps of digitization so far, namely (1) the scans (and the importance of doing them well), and (2) the o.c.r. (and the importance of using abby v8, the best program). there are many steps in the digitization process, and it's important to make sure that _none_ of them will become a weak link in the chain... after the o.c.r. is finished, step #3 in the recipe is the _o.c.r._cleanup._ in the lingo of distributed proofreaders, this is called "preprocessing"; (it's so-named because it's done before the text goes to the proofers, contrasted with "postprocessing", which is the d.p. step that happens after the text has gone through the proofing and formatting rounds.) it is in the preprocessing -- or, more accurately, the _absence_ of it, almost completely -- where d.p. reaches its full nadir of inefficiency. it doesn't take much intelligence to intuit that if you can correct errors in one fell swoop, that'll be more efficient that fixing 'em one at a time. and it is the preprocessing stage where you're fixing errors _globally_. here are some of the chores that i routinely do in "preprocessing": 1. do basic integrity checks, to find and fix missing or doubled pages 2. fix section headers, usually the text in and around chapter titles 3. fix frontmatter pages, which typically return inferior o.c.r. results 4. fix runheads and pagenumbers, for best navigational grounding 5. obtain and evaluate the "vocabulary" of the book -- the words in it these chores are ones that must be done _eventually_, and it is simply _more_efficient_ to do them sooner than later, so i do 'em right away... in other words, having these things be _correct_ from the very outset returns many beneficial aspects in your further handling of the text... i'm not gonna belabor the point, because i've already had to discuss it (both here and on the d.p. forums) many times more than it deserves -- it's common-sense that two numbering systems will be confusing -- but one of the most _maddening_ things that d.p. content providers _routinely_ fail to do is to name image-files using their pagenumber. that is, page 37 might be located in the file named "049.png", and then of course page 49 will be found in the file named "061.png"... this is ridiculous. it's best to name the file for page 37 as "037.png", so it's absolutely clear just by looking at its name what its content is. (it's also good to prepend the number with a string to make it unique -- every scan-set over at d.p. starts with "001.png" and goes up, so there's nothing to make the names unique. but forget that for now.) every time someone who wants to go to page 49 enters their "49", only to find out that they've ended up with the scan for page 37, so they must do the subtraction routine in their head to figure out that the offset is _12_, and thus for page 49 they must enter "61", that's some of their time and energy that was wasted unnecessarily. maybe you say it's not much, but when you multiply it by _multiple_ instances on every day of every week of every month of every year... so one of the first things i do with a d.p. scan-set is rename the files. when i want to see page 37, i want to enter "37" and be done with it. to do anything differently than that is to ignore a _basic_ efficiency... (and of course this helps you to quickly discover any missing pages, which -- as any experienced person knows -- are always a hassle.) also related to the vital matter of easy navigation around the text is the section-headers, which is why i clean those up right away... i like to be able to ground myself with an accurate table of contents. i also like the power of a "chapter menu" generated _automatically_, as well as the ability to "skim" the chapter headings, forward or back. for all of these things, consistent formatting of headers is required... i also find that the runheads and pagenumbers are an _essential_ element in providing "grounding" while i am working on the text... (which, thinking about it, is the very utility they provide to readers.) ironically, d.p. often strips away these runheads and pagenumbers. (even more ironically, they're considering a "meta-data" round where they will have volunteers _re-enter_ the pagenumbers they stripped! how's that for unbelievable stupidity?) as for the frontmatter pages, i often find that i have to shuffle some of them around, both for esthetics and to help in the image-naming (i.e., it's fine to delete blank pages to make numbers come out right.) *** but far and away the most important of the 5 things mentioned above is generating the "vocabulary" of the words that are used in the book. this is a straightforward task, and it's easy to program a tool to do it: 1. read in the o.c.r. text. 2. change spaces and tabs to line-ends, so every word is on its own line. 3. sort the lines. (use an ascii sort, so initial-cap words sort to the top.) 4. spellcheck the words, sorting them into piles that "pass" and "fail". 5. examine the initial-cap words, most of which will be names, and 6. move high-frequency ones and "looks right" ones to the "pass" pile. 7. examine the initial-lower words, most of which will be scannos, but 8. move high-frequency ones and "looks right" ones to the "pass" pile. the "pass" pile is the "vocabulary" for the book, and it should be used as the "dictionary" for all further spell-checking that you'll do on this book. (you might not see the importance of this now, but do keep it in mind.) if you've gotten _reasonably_ good o.c.r. out of your scans, then you can generally even have the machine do this entire process _automatically_, by "passing" instances of a word that occurs 3 or more times in the book. (and remember, if you _haven't_ gotten "reasonably good" o.c.r., then you really need to go back and fix _that_ problem before you even proceed...) as for the "fail" pile, that's a real goldmine in disguise, as it allows you to zero in on the problems with a laser-like focus. i have a program that zooms me from one of these bad words to the next, pulling up the text (with the bad word pre-selected) _and_ the scan of the respective page, for a quick-and-easy check. when the "bad" word turns out to be correct, i click a button that (a) adds the word to the vocabulary, and (b) moves me to the next bad word. if the "bad" word is indeed wrong, i just do the edit. if the new word, as edited, is not in the dictionary, it will be added as well. this interface is _amazingly_fast_ at fixing mistakes. i mean, really fast... and all of this can -- and _should_ -- be done during preprocessing... if you want, you could fine-tune my system to ignore the bad words that occur only once in the book (and maybe even only twice or three times), on the assumption that it's just as "efficient" for the proofers who're doing each individual page to handle these infrequent bad words. but if it's me doing the proofing, i'd rather have the system direct me to the problems, and facilitate my handling of them, rather than make me locate each one, so then my "proofing pass" serves as a "verification check" on the change. and for those errors that pop up _repeatedly_, there is _no_question_ that it's more efficient to correct them on a _global_ basis than _individually_... sure enough, every once in a while you see some proofers observing that "it sure seems like it'd be a lot easier to make this change project-wide..." but they just get ignored, or patted on the head. notice that my system makes it very simple to add a word to the dictionary. (in the future, i'll make it just as easy to delete a word from the dictionary.) that brings up a very important point, namely that the book's vocabulary is constantly in flux, with words being added (and maybe deleted) from it, possibly right up until the very last word is proofed on the very last page... so one thing you need is a mechanism that lets you _check_ occurrences of a word, so that you can decide whether it was added _correctly_ or not. (whenever any word is added, it should be checked throughout the book.) *** there are lots and lots of other instances where preprocessing can help, over and above the situation of _correcting_words_ in a global manner. just to give some common cases: 1. delete any space before commas, semicolons, colons, and periods 2. adjust "spacey" quotemarks (ones with whitespace before and after) 3. change "1" inside a word to an "l" when doing so passes spellcheck 4. change "0" inside a word to an "o" when doing so passes spellcheck 5. change an "l" or an "o" to "1" or "0" if all other characters are numbers 6. lowercasing an uppercase "o" or "w" if it happens to occur mid-word 7. detecting and formatting section-headers in the appropriate manner 8. finding and fixing pagenumbers that were not recognized correctly 9. locating and correcting blank lines erroneously injected or deleted as indicated here, _punctuation_ in general is often an o.c.r. troublespot. (and, to be fair to the o.c.r., a good number of those excessive spaces are clearly present in the book, thanks to old-time typographical practices, so the o.c.r. results are actually accurate, they're just not what we now want.) so fixing punctuation glitches is one great things about preprocessing... also, since these changes are made _before_ the text goes to the proofers, you can usually can make 'em _automatically_ with only minimal checking, because they will be _verified_ by the proofers, who will catch any errors... this arena of preprocessing is _the_ one where d.p. is _most_ inefficient, so i ended up pointing out instances of it over and over and over again on their forums. it was one of the easiest fish to shoot in that barrel, since it comes up _constantly_. indeed, here are some posts from their messages boards _today_ where the lack of adequate preprocessing has reared its ugly head: > unflagging common player names in spaulding guide to baseball > http://www.pgdp.net/phpBB2/viewtopic.php?p=331080#331080 > > invisible utf8 characters in the code (could be deleted globally) > http://www.pgdp.net/phpBB2/viewtopic.php?p=341329#341329 > > tables (it's rather easy to handle table formatting programmatically) > http://www.pgdp.net/phpBB2/viewtopic.php?p=341302#341302 > > blank line(s) at the top of a section header causing bad "diffs" > http://www.pgdp.net/phpBB2/viewtopic.php?p=341183#341183 again, all of these examples were obtained from posts made _today_. and that's not unusual, because preprocessing ramifies in a big way... i could go on and on giving more examples, but i think it's clear now. any time you can get the machine to fix an error instead of a human, you're going to improve your efficiency. and your volunteer retention. and i'm not gonna pick on the d.p. "wordcheck" -- the new version of its spellchecker that enables people to add words to the dictionary -- because they're still feeling their way with it; but it needs improvement. hopefully, they will be able to figure out how to fix it. (i told 'em, but...) it's also interesting that -- now that they've "exiled" me -- they can act like it was all their idea to improve the preprocessing that they do, and thus go to work on that. indeed, that's just what they have done. (without me there saying "that's what i was telling you to do all along".) let's hope their efforts in this regard don't bog down, because this is an arena where they waste _far_ too much volunteer time and energy. ok, so the workflow so far: 1. get good scans. (crop and deskew, and test despeckling.) 2. get good o.c.r. (do a number of tests to get the best output.) 3. do preprocessing. (changes made globally are most efficient.) more later... -bowerbird p.s. i _will_ point out that, in order to get the most out of preprocessing, it's absolutely necessary that you create an interactive correction tool, and the people over at d.p. don't seem to realize that yet. hopefully they will, especially since it's not at all difficult to program such a correction tool... ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/314381eb/attachment-0001.htm From ricardofdiogo at gmail.com Mon Jul 2 17:39:47 2007 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Tue, 3 Jul 2007 01:39:47 +0100 Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3 In-Reply-To: References: Message-ID: <9c6138c50707021739h50e795b9v3c34e137304997e9@mail.gmail.com> 2007/7/3, Bowerbird at aol.com : > this is a straightforward task, and it's easy to program a tool to do it: (...) > it's not at all difficult to program such a correction You may find some inspiration by reading Project Gutenberg's FAQ http://www.gutenberg.org/wiki/Gutenberg:Tools_FAQ#What_programs_could_I_write_to_help_with_PG_work.3F > it's also interesting that -- now that they [DP]'ve "exiled" me -- they can > act like it was all their idea to improve the preprocessing that they do, > and thus go to work on that. indeed, that's just what they have done. > (without me there saying "that's what i was telling you to do all along".) ROFL. (sorry). From Frank.vanDrogen at bc.biol.ethz.ch Mon Jul 2 22:09:52 2007 From: Frank.vanDrogen at bc.biol.ethz.ch (Frank van Drogen) Date: Tue, 03 Jul 2007 07:09:52 +0200 Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3 In-Reply-To: References: Message-ID: >(2) the o.c.r. (and the importance of using abby v8, the best program). Actually, Finereader 7.0 does much better then 8, on many aspects. Frank From Bowerbird at aol.com Mon Jul 2 23:21:15 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 3 Jul 2007 02:21:15 EDT Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #1 Message-ID: juliet said: > on many points well, well. one thursday juliet is _banning_ me from discussions on her site, and the very next thursday she is joining a conversation i've initiated elsewhere. it's a good thing i am a huge fan of _irony_, isn't it? > on many points bowerbird is just uninformed or > making unfounded assumptions and accusations. that's a strong charge. let's see your documentation of it. i've spent a lot of time on your forums getting "informed", and my "assumptions and accusations" are _well_ founded and strongly grounded. and i'll be happy to relate a _ton_ of examples for any point that you believe is questionable just let me know, and prepare yourself for a big deluge... in the meantime, drop the ad hominem tactics, please. and if you can't respond to the topic directly, stay silent. > We do strongly encourage deskewed, > reasonably well cropped, decent scans. then how come you don't get more of them? why do so many of your scan-sets look crappy? i invite _anyone_ to step over to d.p. and look at their archived scan-sets: > http://www.pgdp.org/ols/ just pick a dozen, at random, and look at some sample pages. maybe you'll agree with juliet that they are all "decent" enough. or maybe you'll agree with me that _many_ of them fall short... i'm not telling _you_ what to think. i'll telling you what _i_ think. also, for those who do not realize the importance of nice scans, i strongly encourage you to download from the internet archive _any_ of the 400 books that were digitized by nicholas hodson. search t.i.a. for "nick hodson" or "athelstane" to find his work, or: > http://www.athelstane.co.uk/ nicholas is one of the most prolific book-digitizers on the planet, having done every aspect of each of those 400 books by himself. he creates a text file and an .html version, as well as a groomed text-to-speech version, and bundles the scan-set into a .pdf, so you can download that and see for yourself what a _respectable_ scan-set looks like -- deskewed and cropped with a nice margin. thumb through the scan-set and see how pleasant the esthetic is when the text-block is in a consistent location on the page-scan. once you've observed how it should be done, you know it's right. i downloaded the scan-set .pdf from this hodson book just now: > http://www.archive.org/details/Harry_Collingwood_A_Pirate_of_the_Caribbees and reminded myself just how delightful it is when it's done right. even nicholas isn't perfect, as i found a glitch on page 329!, but even his worst example is as good as the best you'll find from d.p. > When there have been content providers who are not doing > a minimally acceptable job at that, they have gotten notes from > me or someone else with authority and experience in scanning. well, i can believe that is true. at the same time, however, the fact still remains that some d.p. scan-sets are awful... here's that one that i pointed to a while back: > http://www.z-m-l.com/go/ortenmc/p123.html (as usual, you can adjust the number for other pages.) and here's the o.c.r. that came out of page 123: > http://www.z-m-l.com/go/ortenmc/p0-123.txt (and yes, i do believe it's possible to get better o.c.r. out of those scans, even though they're quite lousy.) yet, to see what a heroic job was done by the proofers: > http://www.z-m-l.com/go/ortenmc/p1-123.txt that's a book going through your system _now_, juliet... if that's a "minimally acceptable" job, then bob's my aunt. and -- just to remind people again -- i didn't pick it out. that book was suggested _to_ me, as a project _for_ me... i have looked at a _lot_ of d.p. scan-sets, and i witness some really crappy scans, and some really crappy o.c.r. scans that if i did 'em myself, i'd consider unacceptable. o.c.r. that if i did it myself, i'd _force_ myself to improve. i would simply be too embarrassed to put that to others who are expected to "proof" it... and, even with all of the many d.p. scan-sets i have seen, to be fully honest, i am only on the _rarest_ of occasions actually _impressed_ by the scanning job that was done... some d.p. scan-sets _are_ clean, to be sure. here's one: > http://www.z-m-l.com/go/goann/goannp123.html (again, adjust the zero-padded number for another page.) but even this very-clean scan-set has not been _cropped_. (click through the pages and see how the upper-left corner bobs and weaves from one place to another as you proceed. compare that to the rock-solid appearance of a hodson .pdf.) and again, let me repeat that it is a _batch_ process to deskew and crop a set of scans. it takes 5-10 minutes to set things up, but then you just click one button to transform _all_ the scans... and yet it makes a huge difference in the quality of the results. > Same for overly large scans, missing pages, etc. i didn't mention the issue of missing pages because i know you've _finally_ come to see what a terrible time-sink that they can be... but they still happen with a frequency that is too high; i detected a missing page in a book a month ago. so it still happens. plus: > http://www.pgdp.net/phpBB2/viewtopic.php?p=341491#341491 now, if you just realized that bad scans are also a huge time-drain... > As someone else pointed out, Finereader automatically > deskews pages unless they are extremely badly skewed. and as i pointed out, in response, this doesn't help the _proofers_ who have to _examine_ those crooked scans, day in and day out... nor does it help all the end-readers who will look at them too... and -- once again -- these are _batch_ operations. why fight 'em? teach your providers how _easy_ they are, and show 'em how much time and energy it saves (not to mention creating a nicer product), and then you won't even need to "require" them to do that for you. > Most content providers do draw text blocks for recognnition i heartily recommend you use spell-check on your messages... typos reflect badly on you in your position... > Most content providers do draw text blocks for recognnition > where the OCR doesn't get it right. this points out one of the most ironic twists in this little tea-kettle: cropping the scans such that the text-block is consistently located means that the "blocks" in the o.c.r. program work _much_ better, and can be drawn _only_one_time_ yet work for the _entire_book_. which means that cropping also saves time of the person scanning! not just the proofers down the line, but the scanner-person directly! still, look for yourself and see that _almost_no_d.p._scan-sets_ have been cropped consistently. don't take my word for it. or juliet's word. go look for yourself, and you will see that i'm accurate and she is not... > http://www.pgdp.org/ols/ go ahead. i'll wait. really. just pick a few books at random. indeed, if you can find _one_ scan-set that's cropped, say so! or, if you prefer to see a book that's _currently_ in the system, go to the "current projects" thread found in the d.p. forums: > http://www.pgdp.net/phpBB2/viewforum.php?f=2 again, pick out _any_ book at random, go to its forum thread and -- right there in the first message -- click on the link to go to the "project comments", where you'll click on "detail level" number 4, which gives a page of links to the scans in the project. view some. again, my bet is that you won't find one scan-set that is cropped... > Again, the worst offenders will here from me stealth scanno alert: _hear_ from me... > Again, the worst offenders will here from me > once the matter is brought to my attention. again, i don't know how you define "worst offenders"... but i can point to plenty of scan-sets i think are so bad i think it's unwise to subject them to volunteer proofers. > We expect all content providers to do some pre-processing nice dodge. but the "some" preprocessing you expect is not enough. and in the past, it seems to my eyes you did very little preprocessing. indeed, in many cases, i didn't seem to be able to see any done at all. (or, to put it more precisely, i saw evidence that you did _not_ do any of the preprocessing that i would consider to be blatantly necessary.) > Probably not as much as bowerbird would advocate, it has absolutely nothing to do what "how much" that _i_ "advocate". it's about how much it's _efficient_ to do. if an hour of preprocessing saves three or four hours of work later down the line, then you do it, and you don't even have to think twice about making that decision... but you guys haven't done enough preprocessing, or even enough _testing_ of preprocessing, to have _the_slightest_idea_ on how much time it can save down the line. indeed, you underestimate it _wildly_... i _know_ this, for a fact, because i have done those tests, carefully. (not that you have to _do_ the tests "carefully", because the results are _so_ striking the outcome is immediately obvious at the outset.) as for my so-called "lack of experience", i can tell you that i have a _lot_ of experience scanning and digitizing. and i was working on projects where i was paid a _flat_fee_, so it was in my direct self-interest to know _exactly_ the most efficient way of going about the entire process, and i learned very fast that any changes i could make _globally_ were golden. if my workflow was as inefficient as the one at distributed proofreaders, i would've been working for minimum wage. as it was, i made out big... > but again, but certainly we do far more than "nothing". you do closer to "nothing" than to what i've found to be efficient. > Some things might be a good idea, but in practice > require more effort than I'm willing to do. i'm certainly not suggesting that a preprocessor spend two hours to save the proofers one hour. that would be a bad use of time... not even suggesting they spend two hours to save two hours, as that would be a wash. (you can change "hours" to "minutes" if you prefer, although it's easy enough to see the equation is identical.) i'm not even suggesting that a preprocessor should spend 2 hours in order to save the proofers 3 hours, since you have more proofers. (although that starts to approach the point where it's questionable.) but if the preprocessor is unwilling to spend an hour of their time -- because it "requires more effort than i'm willing to do" -- and it _costs_ the proofers down the line a full _4_ hours of time, that is a flagrant abuse of the time and energy being volunteered to you. in light of those donations, it's morally wrong to allow a disparity in the amount of work that one volunteer can displace to another. and if your system allows _big_ displacements, it needs to be fixed. > Remember, again, that the content providers are all volunteers. i know that very well. did you really think i'd "forgotten" it? c'mon... it's cases where a content producer is unwilling to spend 2 hours to save the later proofers 4 hours that make your workflow inefficient. would preprocessors like it if the proofers laid 4 hours of work on _them_ to save 2 proofer hours? i'm quite sure the answer is "no". also... it might help to tell people here that there is _one_person_ at d.p. who's been testing what i've been saying on a big project over there. (his handle is dkretz, and the project is the encyclopedia brittanica.) he has _consistently_ been reporting that his results are _excellent_, and thus he highly recommends preprocessing as being worthwhile. (and he hasn't even experienced some of the deep benefits thus far.) implement my suggestions. (even if you claim them as your own.) once you do, you'll find out that i was giving you excellent advice... and then maybe you'll stop making the ad hominem attacks on me. (and, as a final reminder on this topic, i do _not_ advocate that it be the content provider who does this preprocessing work. it could be _anyone_. indeed, i believe this should be a specific _role_, just like the "postprocessor" is now. and, for the record, i suggest it's best for one person to fill both the preprocessor and the postprocessor roles.) > the majority of our content providers use Abbyy Finereader. but there are still _many_ books in the system from other programs, and i'd guess that a non-insubstantial number of the abbyy ones are not from abbyy v8, even at this late date. (do you keep track of that?) let it be fully known that i _do_ appreciate that you have come to know the importance of using the best program out there, and now advise it. (although it is equally important to use the _best_version_ of it as well.) but i'm making a point here to the people who might not know it yet, and -- because you once did many books with inferior o.c.r. apps -- there is a large degree of inefficiency in your system due to that fact, an inefficiency for which your _volunteer_proofers_ are paying a price. understand i'm not _blaming_ you for the inefficiencies of the past, even those that are still lagging into the present-day work on-site. but _neither_ am i willing to forget about present-day texts entirely, and the inefficiency they still represent, because you have changed your policy since. there's still a huge amount of lingering inefficiency. > The ones who don't use it, typically can't afford to buy it ok, surely you aren't trying to imply that i say they should buy it. however, by the same token, you're also not trying to imply that -- just because person x can't afford to buy the right program -- therefore person x has the right to inflict poor o.c.r. on proofers. are you? if someone can't do the task right, for want of the proper tools or for any other reason, then they should pick some other task. > The ones who don't use it, typically can't afford to buy it > or already have another good OCR program. what "another good o.c.r. program" would that be? because i'd like to run some tests to compare them. > I did approach Abbyy several years ago about various issues > relating to Finereader and DP's experience with it. But I got nowhere. then tell people that. so they know you have gone to bat for them. and _try_again_. and again and again. show me you really care, and _i_ will go and bug 'em. and you now how irritating i can be. or get someone else to intercede if you can't get the job done... because abbyy _is_ doing deals. they're collaborating with _many_ institutions. so i don't see why they wouldn't work with p.g., since it's been the premiere free cyberspace library from the get-go, right? > The only "old-book" version that I know about is one that > OCR's black letter and fraktur texts. at their home page, click "products", then click "o.c.r. software", for: > ABBYY FineReader XIX > The first omnifont OCR software for Fraktur and old European > language recognition. It is specially designed for converting > ancient documents and books printed in 18-20th centuries > into digital text. It combines all of the power of > ABBYY FineReader Corporate Edition with special capabilities > for reading old European languages. sounds like the ticket to me. > The only "old-book" version that I know about is one that > OCR's black letter and fraktur texts. The pricing on it is > over a thousand dollars, requiring both the purchase of > the OCR package and then upfront purchase of the use of it > on a fixed number of pages. It also only works with Windows, > as far as I know, so we wouldn't be able to run it on our LINUX server. > If the pricing has come down significantly (which I hope it will) > or there is some other package that I'm not aware of, I always > appreciate having these things pointed out. check out their site. talk to them. work something out. you can do it. but _at_least_ test it out, so you can see whether it's worth the money. and if it _is_, then purchase it, if that's what you have to do. you have the money, and your volunteers will donate more if they see it works... > All of this has been said in the DP forums, several times. but you haven't checked out the abbyy site, or talked to them, or worked something out. show you care, juliet. you can do it. *** in closing, i'm _glad_ you're finally starting to see the light on how preprocessing can make your system more efficient. that's good... (and i'm proud to have served in helping you to get that realization, even if it means you choose to dislike me because i did. it's worth it.) at the same time, this doesn't excuse the inefficiencies that you have foisted upon innocent _volunteers_ over the course of _years_ now... and it surely won't bring back the people who have already left. -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/fe63c726/attachment-0001.htm From Bowerbird at aol.com Mon Jul 2 23:33:51 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 3 Jul 2007 02:33:51 EDT Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3 Message-ID: frank said: > Actually, Finereader 7.0 does much better then 8, on many aspects. i've heard that on some occasions. (i've also heard v7 is faster.) but i trust nicholas hodson's experience on this issue most of all. he had about _400_ auto-corrections he'd make to his v7 output. (can't remember the exact numbers, but i do remember the ratio.) his testing showed that with v8, however, only 200 were necessary. i'm open to contrary results from tests that are performed rigorously. (jose menendez, for example, reports "almost perfect" o.c.r. out of his 1998 version of textbridge, and you can't argue with near-perfection.) but in the absence of such tests and such results, i believe nicholas... -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/03a034ce/attachment.htm From shabam.dp at gmail.com Tue Jul 3 09:20:29 2007 From: shabam.dp at gmail.com (shabam) Date: Tue, 3 Jul 2007 09:20:29 -0700 Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3 In-Reply-To: References: Message-ID: <1ac896090707030920i1890f49dg5a44e47b3cc3805@mail.gmail.com> > Actually, Finereader 7.0 does much better then 8, on many aspects. Not only that, but 7 is much cheaper than 8, (get it on eBay) and for those of us without an endless supply of cash, this is important. 6 works very well also. Besides, better OCR does not mean better product and getting it done faster. It only means that the round that runs the fastest gets less work to do, and a single person gets more responsibility. The type in projects I have run (a few black-letter and a couple handwritten manuscripts) run just as fast as similar projects that have good OCR. The type of project has a lot more to do with it than the quality of the OCR. "Alice's Adventures Underground" was a handwritten manuscript, and was typed in in P1. It ran through the rounds very quickly. My other children's books run just as quick. However "Assemble of Goddes" is blackletter poetry in middle English. Not very popular. It runs very slowly, and ran at the same speed as a blackletter, middle English, poetry that was typed in prior to P1. After they are done, the type-ins are just as high a quality as OCR'd projects. If a person is doing the project themselves, then the OCR is a lot more important. The idea behind DP is DISTRIBUTION. With a bunch of pre-processing stuff being done by a single person, so that a group of people can work less defies the idea of distributing the work. There has been some talk of distributing pre-processing, and some people do this by using the OCR pool (because they do not own an OCR program) or using scans someone else scanned (harvesting), but for the most part, this is all done by a single person. As a CP/PM I have to weight the tradeoffs of higher quality preprocessing and more time to do other stuff. Doing stuff with my family wins. That said, I do provide fairly a high quality product, as do most of the CPs and most PMs will double check them as well. I do think that page scans should be cropped (not to tight, but no 3 inch margins either), the project should be checked for missing pages (we have been talking about distributing this for years), and the scans should be readable. In some cases (like a picture book I just released) this means using grayscale or color images. The project should also be run through the preprocessing program we have (guiprep). Much beyond this, and the CP and PMs time could be better spent in areas that need more people (the third proofing round and 2nd formatting round or post-processing and verification). Yes, DP is imperfect. So is everything else in life. We know this, but we are all volunteers. There are no paid staff to spend their lives making DP better, so we get improvements when the hardworking volunteer staff has time to take away from their families, work, social life, or what have you. We have been talking about these improvements for a long time (since the beginning of DP), and will continue to talk about these improvements. They come in spurts, and some of the best never get implemented (lack of money/programmer time/agreement). DP does a great job, and PG could not have as many high quality books as they do without the help of DP, even as imperfect as it is. Not everyone will agree that DP is a great thing. Some people will always have negative thought about us. That is there right. They often have incorrect information and think they are right, and don't realize that other people need to agree and then the time needs to be taken to do it. Perhaps if instead of complaining, these people would spend some time helping to improve things, some of these things would get done, but until we have endless man hours and money and a big enough stick to convince enough people that our way is right, and theres is not, some things will never get done. Any chance of talking about something meaningful? Jason -- Person to person lending. Lend money to others, and get a $25 bonus. http://www.prosper.com/join/shabam -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/b67d1f3d/attachment.htm From Bowerbird at aol.com Tue Jul 3 09:39:28 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 3 Jul 2007 12:39:28 EDT Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3 Message-ID: jason's post is an excellent example of the confused thinking over at d.p. i won't bother responding to it for now, but i strongly encourage all of you to read it, and _examine_ it closely, to see if you can tell why it's off-base... -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/7e748ca8/attachment.htm From Bowerbird at aol.com Tue Jul 3 13:55:11 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 3 Jul 2007 16:55:11 EDT Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #4 Message-ID: ok, so the workflow so far: 1. get good scans. (crop and deskew, and test despeckling.) 2. get good o.c.r. (do a number of tests to get the best output.) 3. do preprocessing. (changes made globally are most efficient.) item #4 in this series is the _proofing_ itself. in my model, most of the "proofing" was done in step #3, where we have our _tools_ zoom us to the "bad words", i.e., infrequent words which had not passed spellcheck, and other aspects of the text that seem to be anomalies (e.g., punctuation irregularities and trackable glitches)... i have argued before that this laser-like focus on _errors_ is sufficient to take the error-level down to the state where any further "proofing" that is necessary can (and _should_) be done by end-users who will read the book for _content_. (this includes stealth scannos, publisher errors, and so on.) i call this "final round" of corrections by readers of a book "continuous proofreading", and conceive of it as a lengthy (e.g., 6-month) period where the book is _only_ available to the general public in the text-next-to-its-scan format, which communicates the proofing expectation to people. (those who strongly want to read the book will still do so, and they're precisely the ones who'll be the best proofers.) only after this 6-month period will a text be released fully. i've said repeatedly an error-rate of 1-error-per-10-pages is good enough for a book to go into continuous proofing, and my research shows that this rate is easily obtainable with the error-focusing tools that i have programmed, so i am certain my workflow will prove itself in the real world. just to make it _clear_, though, let me say it quite directly. i strongly believe there is no need to proof every word on every page against the original scan. if the scans are clean, and you did a good job of doing the o.c.r., and you were conscientious about careful use of a tool to clean the o.c.r., your "continuous proofers" who are reading for _content_ will move your e-text to a state that approaches perfection. now, as i said, i'm _absolutely_convinced_ this is the case... but i couldn't blame you if you are skeptical of my claims... after all, "proofing against the scan" methodology has been the primary means of doing this job for a very long time now. so even though i think that it is _wildly_inefficient_ to proof every word on every page against the scan, that's not why i say that _distributed_proofreaders_ is inefficient. even if you accept their working assumption that this is the way it _must_ be done, there's still a lot of inefficiency within their workflow. so, for the rest of this post, i will deal with that d.p. workflow. (even though -- for my purposes -- it's completely outdated.) i have said _many_times_ -- both here and elsewhere -- that the major inefficiency of the d.p. proofing workflow is that it assumes _every_ page in every book needs the same amount of attention, as evidenced by the fact that all pages in a book are subjected to the same number of "rounds". it seems to me that that belief is _patently_absurd_, and that some pages are clearly more difficult than others, and thus need more rounds. the answer, of course, is a _roundless_system_, one that gives an individual page as many "rounds" as it needs, to be finished. how do we know how many rounds that is? the answer is simple: when a certain number of people have looked at a specific page and found "no corrections required" (n.c.r.), we can call it done... you can set the "certain number" at any level you like. i think 2 is enough, but you could make it 3 or even 4 if you wanted. but once that criterion number have given a page the "n.c.r.", you call it "finished" and move on. it's important to note that there might _still_ be an error on that page, but when there is, we'll expect the "continuous proofreading" process will find it. (so as long as the errors are under-1-in-10-pages, we're fine.) also note that it's dirt-simple to test whether any changes were made to a page, simply by whether the "before" equals the "after"; so we don't need to get into any complex ways to find the "diffs". so that's all i'm gonna say about "roundless" in this message now. d.p. has said it intends to move to a roundless system "eventually", and when they do, they'll reduce their inefficiency in a major way, so i dearly hope that that will happen _sooner_ rather than _later_. because the current system of sending too many pages through too many rounds is a big waste of the proofers time and energy. *** but -- since part of the purpose of this series is to document the inefficiencies in the distributed proofreaders workflow -- i must turn an eye toward the proofing rounds as they currently exist... first and foremost among the problems, of course, is the horrid spellchecker that was in use for many _years_ up until recently, which i documented fully in the message that i posted yesterday entitled "i decided to think about the future, not the past"... i won't repeat that here, except to say that i'm _greatly_ relieved they have plugged that hole. it's impossible for me to ignore the pain and suffering proofers were subjected to for so many years -- most especially those proofers in p2 who were _required_ to use that inferior tool, instead of being given a much better one -- so it'll take a long while before that bruise heals completely, but at least the bleeding is now stopped, and i'm appreciative for that. to recap, i started my constructive criticism of d.p. inefficiency in late 2003, which makes it some three-and-a-half years back, so i'm glad the last three-and-a-half _months_ have seen progress. but, you know, it's a little bit late, in my opinion. ok, a _lot_ late. for those who might be curious, here's the u.r.l. taking you to my christmas 2003 "constructive criticism" thread on the d.p. forums: > http://www.pgdp.net/phpBB2/viewtopic.php?t=5963 you will find that i laid out most of these points a long time ago... *** at any rate, here is a brief run-through of my remaining comments on the state of the _proofing_ interface of distributed proofreaders. i think there needs to be a better way of flagging possible errors. the wordcheck screen is ok, but it has to be summoned separately. in addition, the fact that wordcheck flags _all_ punctuation, and not just the punctuation that's probably wrong, is very bad overflagging. next, there needs to be a system of automated diffs, so that proofers are given quick feedback on every single page that they have done... (by "automated" diffs, i mean that the earlier proofer is notified by an automatic process that the diff is available for their inspection.) in addition to the training aspects, a benefit of automated feedback is that an automated diff lets the earlier proofer _verify_ work done by the _later_ proofer, so if the later proofer introduces any errors, the earlier proofer can draw attention to the goof. very important... i believe it's the case that there are efforts underway (but again, initiated in the last 3.5 months rather than the last 3.5 years) to bring about these improvements, and i welcome that "initiative". i'm sure there are some other things i'm forgetting right now -- it's been a long time since i proofed my pages over at d.p. -- but that's my general wrap-up on the suggestions i would make. in addition, i will say that there are a bunch of nice features on the current interface, including a wide variety of pop-up menus that help proofers with difficult things like greek characters, and so on. the interface also gives people a choice between "horizontal" and "vertical" display, which is a nice option that i don't give people, so score one exclusively for the d.p. side in that regard. and once the wordcheck capability stabilizes on some solid ground, there's a good chance people will figure out how to make it perform all kinds of useful additional functions that will help out proofers... but the development i am most keen on, in regard to the interface for proofers, is the development effort of dkretz, who is designing a new interface to be used in the upcoming "dp50" offshoot of d.p. while i believe that -- in most regards -- it will be very similar to the current interface, at least in terms of existing functionalities, dkretz has focused on building in a wide variety of error-flagging, and that holds the potential for greatly facilitating the proofing... also, another wrinkle could potentially be extremely fascinating, namely that dkretz is planning on using a form of "light" markup, rather than the kludgy markup that has evolved on the main site... this "light" markup falls mainly into the category of "formatting" -- at least as the distinction is currently made on the main site -- but since "light" markup doesn't _obfuscate_ the text (which was the primary reason that formatting was split off from proofing), my sense is that the proofers will end up doing all the formatting -- helped along by the ability to have that formatting displayed, in a direct-feedback way that lets them quickly do corrections -- and there will be no _need_ for a formatting round, let alone _2_, which will create a tremendously more efficient overall workflow. indeed, it seems to me that even the _postprocessing_ role will also be greatly diminished in importance, because the proofers will be doing more and more of the tasks associated with that... but i'll handle that more directly in my _next_ post in this series, which will focus on _formatting_, the next big step in our recipe. but that probably won't be until next week, so enjoy the holiday! -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/22a4979f/attachment.htm From Bowerbird at aol.com Wed Jul 4 16:30:58 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 4 Jul 2007 19:30:58 EDT Subject: [gutvol-d] independence day Message-ID: there's this anonymous person who writes a blog that purports to be "the secret diary of steve jobs". fake steve has become one of _the_ most-read blogs in cyberspace, thanks to entries like this one today: > http://fakesteve.blogspot.com/2007/07/music-industry-nobs-have-finally.html here's the heart of that post: > The music companies are in a dying business, and they know it. > Sure, they act all cool because they hang around with rock stars. > But beneath all the glamour these guys are actually operating > two very low-tech businesses. One is a form of loan-sharking: > they put up money to make records, then force recording artists > to pay the money back with exorbitant interest. The other business > is distribution. They've got big warehouses and they control the > shipment of little plastic boxes that happen to have music in them. > The guys running the labels are pretty stupid -- most are just dirtbags > who started out as band managers or promoters -- but now at long last > they are kinda sorta finally vaguely getting clued in to the fact that > both parts of their business model are fucked. boom. i suppose it's rather easy to see how this extrapolates to _books_. all in all, it's a classic fake-steve entry, especially for independence day... -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070704/e3d3858b/attachment.htm From tb at baechler.net Fri Jul 6 05:41:40 2007 From: tb at baechler.net (Tony Baechler) Date: Fri, 6 Jul 2007 05:41:40 -0700 Subject: [gutvol-d] Hugh McGuire - Jon Udell's Interviews With Innovators Message-ID: <20070706124140.GC10531@investigative.net> Hello all, This is slightly off topic but might be of interest to some of you as LibriVox uses PG files as their source material if I understand correctly. Anyway, here is an interview with the founder of LibriVox. From: "GigaVox Media (All Channels)" Audiobooks are an excellent way to make books available to everyone. When Hugh McGuire founded LibriVox in 2005, he wanted to take advantage of the masses of book lovers across the world to record and make available a catalog of audiobooks. On this week's Interviews with Innovators, Jon Udell speaks with McGuire about the origins, growth and distinctive architecture behind LibriVox. URL: http://feeds.gigavox.com/~r/gigavox/network/~3/110500128/detail1783.html Enclosure: http://feeds.gigavox.com/~r/gigavox/network/~5/110500129/ITC.INNO-HughMcGuire-2007.04.18.mp3 ----- End forwarded message ----- From j.hagerson at comcast.net Fri Jul 6 07:17:46 2007 From: j.hagerson at comcast.net (John Hagerson) Date: Fri, 6 Jul 2007 09:17:46 -0500 Subject: [gutvol-d] Unwrap lines utility? Message-ID: <000001c7bfd8$6f4c48e0$1f12fea9@sarek> I am corresponding with someone who would like to be able to unwrap the paragraphs from some of the older, plain text, material in our collection. I provided him the naive, three search-and-replace solution, but he says that his attempt to implement it on his computer with the file he has chosen causes his word processor to lock up. He is running Microsoft Windows XP. Has anyone already written a utility to do this? If so, please send me a pointer to it. Thank you very much. From desrod at gnu-designs.com Fri Jul 6 07:59:45 2007 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Fri, 06 Jul 2007 10:59:45 -0400 Subject: [gutvol-d] Unwrap lines utility? In-Reply-To: <000001c7bfd8$6f4c48e0$1f12fea9@sarek> References: <000001c7bfd8$6f4c48e0$1f12fea9@sarek> Message-ID: <1183733985.2664.279.camel@localhost.localdomain> On Fri, 2007-07-06 at 09:17 -0500, John Hagerson wrote: > I provided him the naive, three search-and-replace solution, but he > says that his attempt to implement it on his computer with the file he > has chosen causes his word processor to lock up. I use Text::Wrap to do the same exact thing, available in CPAN: http://search.cpan.org/~muir/Text-Tabs+Wrap-2006.1117/lib/Text/Wrap.pm I can't speak to why your friend's word processor "locks up", but then again, I don't run legacy operating systems or applications like Windows, so I won't be of much help there. Perl works on Windows, and this module is available there. It might be easier to just use that instead. -- David A. Desrosiers desrod at gnu-designs.com setuid at gmail.com Skype...: 860-967-3820 From Bowerbird at aol.com Fri Jul 6 09:33:00 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 6 Jul 2007 12:33:00 EDT Subject: [gutvol-d] Unwrap lines utility? Message-ID: john said: > unwrap the paragraphs from some of the > older, plain text, material in our collection http://www.z-m-l.com/go/unwrap.pl 1. paste text into box and click button. 2. copy unwrapped text underneath box. 3. paste into new document. the script won't wrap lines that start with one (or more) blanks, so you can use that rule to immunize any lines you don't want to wrap... -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070706/923ba39a/attachment.htm From desrod at gnu-designs.com Fri Jul 6 09:50:55 2007 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Fri, 06 Jul 2007 12:50:55 -0400 Subject: [gutvol-d] Unwrap lines utility? In-Reply-To: References: Message-ID: <1183740656.2664.291.camel@localhost.localdomain> On Fri, 2007-07-06 at 12:33 -0400, Bowerbird at aol.com wrote: > the script won't wrap lines that start with one > (or more) blanks, so you can use that rule to > immunize any lines you don't want to wrap... Nor does your script work with multiple lines NOT separated by a blank line. Tsk. Tsk. -- David A. Desrosiers desrod at gnu-designs.com setuid at gmail.com Skype...: 860-967-3820 From ricardofdiogo at gmail.com Fri Jul 6 11:20:23 2007 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Fri, 6 Jul 2007 19:20:23 +0100 Subject: [gutvol-d] Unwrap lines utility? In-Reply-To: References: Message-ID: <9c6138c50707061120u5fd11a45j91697ab173934bbc@mail.gmail.com> 2007/7/6, Bowerbird at aol.com : > http://www.z-m-l.com/go/unwrap.pl ISO-8859-1 doesn't seem to work with it too. With some improvements, it may be a good tool. From ricardofdiogo at gmail.com Fri Jul 6 11:24:44 2007 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Fri, 6 Jul 2007 19:24:44 +0100 Subject: [gutvol-d] Unwrap lines utility? In-Reply-To: <000001c7bfd8$6f4c48e0$1f12fea9@sarek> References: <000001c7bfd8$6f4c48e0$1f12fea9@sarek> Message-ID: <9c6138c50707061124o3a4c5eb6x11debe3c77fa785f@mail.gmail.com> 2007/7/6, John Hagerson : > I am corresponding with someone who would like to be able to unwrap the > paragraphs from some of the older, plain text, material in our collection. I > provided him the naive, three search-and-replace solution, but he says that > his attempt to implement it on his computer with the file he has chosen > causes his word processor to lock up. > Same used to happen with my old computer. Guess it must be a memory/processing thing. I then started to it by processing several separated text chunks. From Bowerbird at aol.com Fri Jul 6 12:37:13 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 6 Jul 2007 15:37:13 EDT Subject: [gutvol-d] Unwrap lines utility? Message-ID: david said: > Nor does your script work with > multiple lines NOT separated by a blank line. > Tsk. Tsk. um, "multiple lines not separated by a blank line"? i don't understand what you're saying here, david. p.g. e-texts have a blank line between paragraphs. perhaps you could point me to a nonworking file? *** michael said: > Wouldn't be be even easier just to i/o from file to file? people who want that should backchannel me. if i get enough requests, i'll set it up that way... *** ricardo said: > ISO-8859-1 doesn't seem to work with it too. perhaps you could point me to a nonworking file? -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070706/d71404d7/attachment.htm From desrod at gnu-designs.com Fri Jul 6 13:35:32 2007 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Fri, 06 Jul 2007 16:35:32 -0400 Subject: [gutvol-d] Unwrap lines utility? In-Reply-To: References: Message-ID: <1183754132.27406.1.camel@localhost.localdomain> On Fri, 2007-07-06 at 15:37 -0400, Bowerbird at aol.com wrote: > um, "multiple lines not separated by a blank line"? > i don't understand what you're saying here, david. > p.g. e-texts have a blank line between paragraphs. Stick this into your form, you'll see what happens. Note, this should unwrap to two lines, as it is presented here. Your code joins the two lines into one long line, breaking it. --- What is Lorem Ipsum? Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. --- -- David A. Desrosiers desrod at gnu-designs.com setuid at gmail.com Skype...: 860-967-3820 From joshua at hutchinson.net Fri Jul 6 14:12:49 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Fri, 6 Jul 2007 21:12:49 +0000 (UTC) Subject: [gutvol-d] Unwrap lines utility? Message-ID: <12549565.1183756369760.JavaMail.?@fh1037.dia.cp.net> As much as it pains me to defend the bird ... Your example, by PG text formatting rules, SHOULD rewrap into a single paragraph. In PG texts, paragraphs are denoted by a blank line between them (two newline characters). The original question was about rewrapping some PG texts, so bowerbird's methodology is good. Please, don't make me agree with the bird again. It makes my head hurt. ;) Josh >----Original Message---- >From: desrod at gnu-designs.com > >On Fri, 2007-07-06 at 15:37 -0400, Bowerbird at aol.com wrote: >> um, "multiple lines not separated by a blank line"? >> i don't understand what you're saying here, david. >> p.g. e-texts have a blank line between paragraphs. > >Stick this into your form, you'll see what happens. > >Note, this should unwrap to two lines, as it is presented here. Your >code joins the two lines into one long line, breaking it. > >--- >What is Lorem Ipsum? >Lorem Ipsum is simply dummy text of the printing and typesetting >industry. Lorem Ipsum has been the industry's standard dummy text ever >since the 1500s, when an unknown printer took a galley of type and >scrambled it to make a type specimen book. >--- > From Bowerbird at aol.com Fri Jul 6 15:21:19 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 6 Jul 2007 18:21:19 EDT Subject: [gutvol-d] Unwrap lines utility? Message-ID: david said: > --- > What is Lorem Ipsum? > Lorem Ipsum is simply dummy text of the printing and typesetting > industry. Lorem Ipsum has been the industry's standard dummy text ever > since the 1500s, when an unknown printer took a galley of type and > scrambled it to make a type specimen book. > --- as i said earlier, project gutenberg e-texts have a blank line between paragraphs. so this text here is non-representative. the task of _restoring_ paragraphing to a text when the blank lines have been eliminated is an interesting one -- and a useful one too, because the blank lines are eliminated in text copied out of a .pdf -- but that's not what john was asking for... -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070706/ad2428be/attachment.htm From Bowerbird at aol.com Sat Jul 7 12:12:27 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 7 Jul 2007 15:12:27 EDT Subject: [gutvol-d] enjoying the cinema Message-ID: if anyone out there is wondering if a cinema-screen is worth what it costs -- and they're pretty cheap now -- i can assure you that the answer is most definitely "yes!" in particular, i am finding that the ability to comfortably fit _3_ "panels" on a single screen is _very_ productive... here's a sample page from one of my text-cleaning tools: > http://www.z-m-l.com/go/triple.jpg as you can see, the original scan is displayed on the left, the text-field for making corrections is on the right, and the formatted version of the text is shown in the middle, for easy comparison to the original scan for correctness. also on the right, at the bottom, there are buttons that allow you to take some actions associated with the page. and all this functionality has more than ample real estate, with text that is sized large enough even for older eyes... -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070707/be7e5288/attachment.htm From robert_marquardt at gmx.de Sun Jul 8 10:56:26 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Sun, 08 Jul 2007 19:56:26 +0200 Subject: [gutvol-d] an Esperanto book to pick up Message-ID: http://librivox.org/forum/viewtopic.php?p=142572#142572 A HTML version is available which is from a PD book and the additions (preface, footnotes etc) have been declared PD by the author. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From shabam.dp at gmail.com Sun Jul 8 16:05:54 2007 From: shabam.dp at gmail.com (shabam) Date: Sun, 8 Jul 2007 16:05:54 -0700 Subject: [gutvol-d] Unwrap lines utility? In-Reply-To: <12549565.1183756369760.JavaMail.?@fh1037.dia.cp.net> References: <12549565.1183756369760.JavaMail.?@fh1037.dia.cp.net> Message-ID: <1ac896090707081605u3349433al22442e60e2bed09e@mail.gmail.com> John, What editor is he using? That could be part of the problem. Some text editors (such as MS Word) are more likely to crash on large search and replaces. Or they might appear to lock up, when they are working, and taking a really long time to run. I use a program called Edit Plus for this. It is shareware, and it does these types of replaces fairly quickly. Another choice is to break the file into smaller chunks that the editor can handle. Jason -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070708/e0009019/attachment.htm From j.hagerson at comcast.net Sun Jul 8 17:20:39 2007 From: j.hagerson at comcast.net (John Hagerson) Date: Sun, 8 Jul 2007 19:20:39 -0500 Subject: [gutvol-d] Unwrap lines utility? In-Reply-To: <1ac896090707081605u3349433al22442e60e2bed09e@mail.gmail.com> Message-ID: <004e01c7c1be$fb0c41e0$1f12fea9@sarek> Thank you for the replies on and off the list. The gentleman with whom I am corresponding is using Microsoft Word on a computer running Windows XP. He has only 400MB of memory installed on the machine, so a larger file does cause issues. I have pointed him to GutenMark and some of the other utilities. However, he would prefer a graphical interface to a command line. I don't know exactly what he wants to do with some of the books. I know that he wants to "fill the page" with text (meaning full justification, which he can do in Word). Thanks again for your assistance. John -----Original Message----- From: gutvol-d-bounces at lists.pglaf.org [mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of shabam Sent: Sunday, July 08, 2007 6:06 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Unwrap lines utility? John, What editor is he using?? That could be part of the problem.? Some text editors (such as MS Word) are more likely to crash on large search and replaces.? Or they might appear to lock up, when they are working, and taking a really long time to run.? I use a program called Edit Plus for this.? It is shareware, and it does these types of replaces fairly quickly. Another choice is to break the file into smaller chunks that the editor can handle. Jason From piggy at netronome.com Mon Jul 9 08:24:53 2007 From: piggy at netronome.com (La Monte Henry Piggy Yarroll) Date: Mon, 09 Jul 2007 11:24:53 -0400 Subject: [gutvol-d] Next project In-Reply-To: References: <000a01c7bb24$c062b6a0$f2226546@Lydia> Message-ID: <46925345.10107@netronome.com> I'd add that I've used advertisements to prove publication dates for various works without explicit publication dates. Ads can be downright valuable. Michael Hart wrote: > > Personally, I _LIKE_ to see that ads from a hundred years ago, > I think it gives a greater perspective on the life of the time > with the first ads for rentable rooms in NYC with kitchenettes > and the various ship names, travel arrangements, etc. . . . > > ... > Michael S. Hart > Founder > Project Gutenberg > > > On Sat, 30 Jun 2007, Robert Marquardt wrote: > >> On Sat, 30 Jun 2007 10:41:26 -0400, you wrote: >> >>> ... >>> I think I will also include the ads as they are unique period >>> ones including music, seeds, manure and a small home printing press. >>> >>> Dick >>> >> >> The Periodicals Bookshelf should give some examples. >> http://www.gutenberg.org/wiki/Category:Periodicals_Bookshelf >> I think the "Bulletin de Lille" may be what you look for. >> >> BTW ads are still sh-- eh manure. From da.ajoy at gmail.com Mon Jul 9 13:11:41 2007 From: da.ajoy at gmail.com (Daniel Ajoy) Date: Mon, 09 Jul 2007 15:11:41 -0500 Subject: [gutvol-d] Unwrap lines utility? Message-ID: <4692502D.20873.13845EEC@da.ajoy.gmail.com> I use Clippy http://wots.coolfreepage.com/link.php?id=SW3 Daniel From Bowerbird at aol.com Mon Jul 9 17:22:00 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 9 Jul 2007 20:22:00 EDT Subject: [gutvol-d] knock me down! july 9th, 2007! Message-ID: good grief! i just got knocked down! hard! i went to google books looking for a book -- "the story of patsy", if you must know -- and saw an option there to "view plain text". sure enough, google is giving us the o.c.r.! this is a _game-changer_, folks. hallelujah! -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070709/7c42df8a/attachment.htm From Bowerbird at aol.com Tue Jul 10 13:51:52 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 10 Jul 2007 16:51:52 EDT Subject: [gutvol-d] game-changer Message-ID: yes sir, a game-changer, and not just in one way, in lots of ways. which means tomorrow, 7/11/2007, is your lucky day, because you get to read one of those _oh-so-rare_ posts from bowerbird where he says "i was wrong", and this one will be even-more-rare, since he says he will say it "more than once". we took this to mean "twice", and asked him to confirm it, and he corrected us to _repeat_ "more than once", leading us to think he might even say it _3_ times! which, reports claim, would stun even the most hardened observers... -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070710/236e6dde/attachment.htm From Bowerbird at aol.com Tue Jul 10 16:15:27 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 10 Jul 2007 19:15:27 EDT Subject: [gutvol-d] Unwrap lines utility? Message-ID: i wrote this on sunday, but didn't send it out. its relevance has been increased by the recent developments, however (since it is _google_ who's losing the paragraphing on o.c.r. that it gives to umichigan, and now to the general public), and thus, here it the message... in some future posts, i'll deal some more with the general question of paragraphing that's missing throughout the book, but the issues are roughly the same as i outlined here when they occur on a pagebreak... now, back to the message as written... *** warning: this message is of interest only to extreme text-cleaning geeks. other people should opt out now... *** since hacker-david brought it up, here's a slight reworking of his example. >?? --- >?? What is Lorem Ipsum? >?? Lorem Ipsum is the dummy text of the printing and typesetting industry. > Lorem Ipsum has been the industry's standard dummy text ever since the > 1500s, when an unknown printer took a galley of type and scrambled it > to make a type specimen book. >?? --- remember that david said his example would be broken into 2 paragraphs, the question and the answer. however, in this reworking, it's ambiguous as to whether it should be _2_ or _3_ paragraphs, since the first line of the answer now contains one complete sentence, and nothing more. so it could be broken as the original example, or instead as three paragraphs, as shown here: what is lorem ipsum? lorem ipsum is the dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. *** believe it or not, this _does_ have some immediate relevant applicability. as i noted, text copied out of a .pdf often is stripped of its empty lines. sadly, the paragraph indents are _also_ lost, which makes it even more difficult to restore the paragraphing. and perhaps even worse, umichigan has lost the empty lines in the text that it is releasing to the general public from the google scanning project. so this release -- which would be an _excellent_ thing, were the text not marred with this (and many other) problems, seeing that google itself is _not_ releasing the actual text, just the images -- is actually kind of sad... > http://mdp.lib.umich.edu/cgi/m/mdp/pt?view=text;id=39015016881628;u=1;num=129 on the home front, one of the tasks in digitizing a book is to make sure that all the paragraphing is correct, and one of the places where this is the most difficult one paragraph ends at the bottom of one page, with a new one beginning at the top of the next page. much of the time, it's clear when a sentence ends on the bottom line of a page that it's the last line of a paragraph because the line doesn't make it anywhere close to the right margin. but sometimes it's not. for instance, take a look at this page and tell me if the paragraph ended: > http://www.z-m-l.com/go/mabie/mabiep122.html go to the next page to see if you were correct. and try these: > http://www.z-m-l.com/go/mabie/mabiep040.html > http://www.z-m-l.com/go/mabie/mabiep067.html > http://www.z-m-l.com/go/mabie/mabiep208.html > http://www.z-m-l.com/go/mabie/mabiep214.html again, you must go to the next page to see if you were correct, to see if the line at the top of the next page was _indented_, indicating that it is the start of a new paragraph. (if you like this game, i've appended a bunch more tests, from another book, one that has more paragraphs in it.) distributed proofreaders has an ongoing "discussion" about whether or not to put a blank line at the top of a page where the first line is indented as a new paragraph, for reasons that i cannot comprehend. _of_course_ there must be a blank line, to indicate clearly the first line is the start of a new paragraph, a decision that cannot be reliably made on the previous page. fortunately, the number of pages that have to be checked for this in a typical book is relatively small, and can be located fairly easily by a computer routine that summons them for human eyeballs... -bowerbird > http://www.z-m-l.com/go/myant/myantf013.html > http://www.z-m-l.com/go/myant/myantp014.html > http://www.z-m-l.com/go/myant/myantp040.html > http://www.z-m-l.com/go/myant/myantp074.html > http://www.z-m-l.com/go/myant/myantp077.html > http://www.z-m-l.com/go/myant/myantp084.html > http://www.z-m-l.com/go/myant/myantp093.html > http://www.z-m-l.com/go/myant/myantp112.html > http://www.z-m-l.com/go/myant/myantp123.html > http://www.z-m-l.com/go/myant/myantp126.html > http://www.z-m-l.com/go/myant/myantp135.html > http://www.z-m-l.com/go/myant/myantp137.html > http://www.z-m-l.com/go/myant/myantp172.html > http://www.z-m-l.com/go/myant/myantp206.html > http://www.z-m-l.com/go/myant/myantp209.html > http://www.z-m-l.com/go/myant/myantp236.html > http://www.z-m-l.com/go/myant/myantp261.html > http://www.z-m-l.com/go/myant/myantp268.html > http://www.z-m-l.com/go/myant/myantp274.html > http://www.z-m-l.com/go/myant/myantp304.html > http://www.z-m-l.com/go/myant/myantp317.html > http://www.z-m-l.com/go/myant/myantp321.html > http://www.z-m-l.com/go/myant/myantp407.html ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070710/2b6fa066/attachment.htm From Bowerbird at aol.com Wed Jul 11 12:40:35 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 11 Jul 2007 15:40:35 EDT Subject: [gutvol-d] they didn't care Message-ID: > http://radar.oreilly.com/archives/2007/07/clay_shirky_a_s.html on the o'riley blogs, jimmy guterman points to: > a dizzying presentation by Clay Shirky in which he likens > the guardians of a Shinto shrine to the perl community. > It also includes one of the best sentences I've heard all year, > one that will ring true for anyone who's tried to convince > entrenched thinkers of the value of innovation: > ?They didn't care that they'd seen it work in practice > because they already knew it wouldn't work in theory.? ring a bell? watch the movie: > http://www.supernova2007.com/downloads/shirky.mov it's rare you hear a techie using _love_ as operative concept, but shirky does, and does it well. -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070711/b66e1bd7/attachment.htm From Bowerbird at aol.com Wed Jul 11 13:02:23 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 11 Jul 2007 16:02:23 EDT Subject: [gutvol-d] i was wrong Message-ID: i love to admit it when "i was wrong". i have argued that google book scanning would _not_ release their o.c.r. results, even for the public-domain books they are scanning, because it is clearly against their best interests from a _business_ point-of-view... google is spending hundreds of millions of dollars on the project, so why just hand over those expensive results to competitors? it makes no business sense... however, as i reported on monday, google is indeed now giving us their o.c.r. results. so i was _wrong_. wrong wrong wrong. dead wrong. sorry about that. i think this also shows that google is _not_ just acting within the tight constraints of a "business" standpoint. which -- considering how some people like to paint 'em as advertising moneygrubbers -- is fairly enlightening. i hope when people sum up the totals for "do no evil", google will be rewarded for releasing this text to us... now, i'm not some google fanboy here. anyone who has looked at the o.c.r. they're releasing will recognize that it is shit. it is badly in need of correction. and i'm _certain_ that google has the know-how in-house to clean it up... so if they really want to impress us, release _that_ text... in the meantime, though, i'm not gonna complain about the low quality of this o.c.r. text, i'm just gonna clean it... the other thing that must be mentioned here, in fairness, is that there is _some_ reason to believe that google has released this text only because they feel that they _must_, to avoid any criticism (and perhaps even legal action) from visually-impaired people who can't use screen-readers on the page-scans that google was _originally_ offering to us. if that's the case, then maybe the act isn't quite so generous after all. then again, perhaps they've done it in the name of "accessibility" -- even though they don't feel that they must -- because they _think_it's_the_right_thing_to_do_, in which case i believe they should be applauded. and then we should take the text they've made available, and run with it. run far with it. anyway, one more time, for all those people out there who can't hear me say it enough: i was wrong. *** and that's not the only thing i was wrong about. i said that nicholas hodson had found that finereader v8 was significantly better than v7, but in attempting to confirm that, it appears that my memory wasn't entirely accurate, and that v8 might not give recognition that is all _that_ much better... i'm following up on it, and will give you the solid facts later, but in view of the fact that everyone agrees that v8 is slower, the case that it is _clearly_ superior to v7 is somewhat shaky. so let me just say "i was wrong" about that too, and get it over. -bowerbird ************************************** See what's free at http://www.aol.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070711/157baf53/attachment.htm From schultzk at uni-trier.de Wed Jul 11 23:56:19 2007 From: schultzk at uni-trier.de (Schultz Keith J.) Date: Thu, 12 Jul 2007 08:56:19 +0200 Subject: [gutvol-d] i was wrong In-Reply-To: References: Message-ID: Hi BB, Am 11.07.2007 um 22:02 schrieb Bowerbird at aol.com: > i love to admit it when "i was wrong". > > i have argued that google book scanning would _not_ > release their o.c.r. results, even for the public-domain > books they are scanning, because it is clearly against > their best interests from a _business_ point-of-view... > > google is spending hundreds of millions of dollars > on the project, so why just hand over those expensive > results to competitors? it makes no business sense... Ahh! You have erred Again. I can not give you an it's true motives. Yet I would say they are building a market. How does Google makes it's money! If Google can sayso and so many are visting us that means $$$$$ for them. They probably realized that most prefer text over images. So as any good company it has change it's business model. Actually, nothing surprising. [Snip, snip, snip] > anyway, one more time, for all those people out there who > can't hear me say it enough: i was wrong. > So you are human after all !! ;-))) (I could help that one) regards Keith. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070712/ce93e0e7/attachment.htm From Bowerbird at aol.com Thu Jul 12 11:09:05 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 12 Jul 2007 14:09:05 EDT Subject: [gutvol-d] where i'm going with this Message-ID: scans and o.c.r. output of thousands and thousands of public-domain books are now readily available to any american with internet access... thank you google! while the o.c.r. often needs heavy corrections, once that editing is done, the text of the entire book can be put into one file and transmitted at will. this text file can be accompanied by image-scans of the illustrations, or even included in the same zip file that contains all the page-scans (in which case the illustrations wouldn't have to be treated separately). the robustness of text-files has been proven over and over, historically. while other file-formats come and go, blowin' in the wind, the text-file -- especially when it is converted to new physical media -- is _solid_... combined with .png or .jpg (both for page-scans and/or illustrations), we can be assured that this package will be readable far into the future. now, if only there was a way to code formatting into those text-files. oh wait, there is, thanks to light-markup systems. we're good to go! *** google has been giving us the scans, and now the text as well. so we have everything we need in order to proof an entire book. that is, even _before_ the text is corrected, the o.c.r. output of a book can be packaged with the scan-set, where both are used as _input_ to a tool that helps a person _do_ the corrections for that book, by comparing o.c.r. text for each page to the image-scan for the page... so now you see where i'm going with this... because long-time lurkers on this list will remember that i have been saying for many years now that i've written such a tool -- codenamed "banana-cream" -- but this has been returned with much skepticism by my antagonists here. they've challenged it, calling it "vaporware"... because i never released the thing, their little echo-chamber did a fine job of convincing itself that their allegations had some merit... um, sorry charlie... banana-cream has been alive and well all this time, waiting for her time to go on stage, and now that glorious time has come, yes it has. the funny thing is, i was already grooming her for appearance soon. because i was tired of keeping her under wraps. i was gonna use the books i have on z-m-l.com as her examples, and i'll probably continue with that "controlled environment" approach, but will now also whip her into shape to tame the wild google monster too. the time has finally come that i saw as inevitable some 2.5 years back: > http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005& post=2005-01-01,1 *** so let's talk a minute about how a member of the general public might go about this task of cleaning up the o.c.r. for a book... now, one of the things that happens, in the course of this cleaning, is that you find yourself jumping all over the book, checking out things. in order to perform this frequent behavior as expeditiously as possible, you need to boil it down to its essence. if you want to jump to page 69, the best tool will let you type in a "6" and then a "9" and then hit [enter] and boom! you're on page 69, with text on one side, image on the other, press 123[enter] and boom!, you should be on page 123, just like that... once again, editable text on one side, the page-scan on the other side. > http://snowy.arsc.alaska.edu/bowerbird/bc003.jpg and, if you have room in the middle, a formatted display of the text, so you can compare to ensure you used the proper z.m.l. for the situation. > http://z-m-l.com/go/triple.jpg (yes, "digitizing" a text means formatting it as well, not just proofing it.) so, we see the basic user-interface for the tool -- a text/scan hybrid -- where we can jump directly to any page. when displaying a page, the tool should fetch the scan from the same directory containing the text file or -- if the scan isn't there -- download it from the web to that local folder. it's important to save it for reuse, not to download the thing every time. moreover, this approach should be utilized by _any_ viewer-program; it's important that we have each book stored online, so that anyone can access it from there at any time. but there's no reason to have a person continually download the same material over and over while re-reading. so a plan to methodically mirror the content locally is the best approach... again, that's what banana-cream does... it will also download the scans in a batch via an unattended process, which is how you'll likely do it, but if you have broadband, you can have it download _while_ you work and you'll find it hard to stay up with its download (e.g., 12 pages/minute)... of course, you want the program to _flag_possible_errors_ that it finds on a page, so you can check 'em. if they do need correcting, you want the tool to _faciliate_ that. again, that, too, be what banana-cream do... all this will become much more clear when i actually release the app, so i'm gonna go off and work on that for a little while. see you later. (i'd say you can expect to see the initial release of the basic engine as early as next week, with various checks being bundled in on a regular basis after that, providing there is any interest in the app. of course, if no one cares, then i won't bother to work on it much.) *** (don't worry, the series about the inefficiency of the d.p. workflow _will_ continue, it's just that with all of these recent developments, i've got more important things to do right now...) *** oh yeah, having the o.c.r. from google readily available also means -- since we have the o.c.r. from the open content alliance as well -- that we can also now focus some serious attention on the strategy of locating glitches in o.c.r. via _comparison_ of different o.c.r. passes. my research into this has found that this approach can give results that are simply _amazing_. i can't remember how much of that i've shared with this list, but i documented my research _thoroughly_ over on the d.p. forums. (and what a big waste of time _that_ was!) if you want to review it (and see how d.p. people ignored it), visit the thread i started over there titled "revolutionary o.c.r. proofing": > http://www.pgdp.net/phpBB2/viewtopic.php?t=24008 -bowerbird ************************************** Get a sneak peak of the all-new AOL at http://discover.aol.com/memed/aolcom30tour -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070712/1af92a9c/attachment-0001.htm From Bowerbird at aol.com Thu Jul 12 15:57:08 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 12 Jul 2007 18:57:08 EDT Subject: [gutvol-d] putting your novel online for free Message-ID: think only unknowns put their novel online for free? think again. these days, even nobel-prize-winners-for-literature -- like elfriede jelinek (class of 2004) -- are doing it. let's make sure we _reward_ brave pioneers like this, with our appreciation and a little bit of cold hard cash, so our cyberlibrary of the future is an appealing place... > http://www.elfriedejelinek.com/ -bowerbird ************************************** Get a sneak peak of the all-new AOL at http://discover.aol.com/memed/aolcom30tour -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070712/3e6c3285/attachment.htm From Bowerbird at aol.com Fri Jul 13 10:37:13 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 13 Jul 2007 13:37:13 EDT Subject: [gutvol-d] digitizing rare and inaccessible books for print-on-demand Message-ID: kirtas, the company that makes the neat scanning machines > http://www.kirtastech.com/ made a very interesting announcement last month: > http://www.kirtas-tech.com/newsletterNew.asp?ID=26 they are joining together with some libraries (cincinnati public and toronto public) and universities (emory, university of maine) to scan thousands of rare and inaccessible books and then distribute them via amazon.com's print-on-demand service. no word on whether the _electronic_ versions will be free... on the one hand, since this is being pitched (at least by kirtas) as a way that these institutions can generate a cash-flow that supports the cost of the scanning, you might presume "no"... on the other hand, one of the things that libraries _do_ is to provide access to books for free, so you might presume "yes". either way, it's good news for book digitizers, because we can always buy one copy of the printed book, digitize it ourselves, and then release the resultant electronic copy out to the wild... -bowerbird ************************************** Get a sneak peak of the all-new AOL at http://discover.aol.com/memed/aolcom30tour -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070713/0a0bce41/attachment.htm From Bowerbird at aol.com Sun Jul 15 21:49:00 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 16 Jul 2007 00:49:00 EDT Subject: [gutvol-d] crop circles Message-ID: another excellent reason to crop your scans consistently and fairly tightly is that they will work much better on the iphone if you do... :+) -bowerbird ************************************** Get a sneak peak of the all-new AOL at http://discover.aol.com/memed/aolcom30tour -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070716/57c5bdde/attachment.htm From desrod at gnu-designs.com Mon Jul 16 12:43:45 2007 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Mon, 16 Jul 2007 15:43:45 -0400 Subject: [gutvol-d] Speaking of OCR and Captcha... Message-ID: <1184615025.9208.2.camel@localhost.localdomain> I didn't see mention of this on the list, and a quick Google search intersecting Project Gutenberg with this other project didn't produce many relevant results, but I think this could have real promise for digitizing the PG archives of OCR'd scans: --- http://www.recaptcha.net/ reCAPTCHA improves the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher. More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly. But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct. Currently, we are helping to digitize books from the Internet Archive --- -- David A. Desrosiers desrod at gnu-designs.com setuid at gmail.com http://projects.plkr.org/ Skype...: 860-967-3820 From f.fuchs at gmx.net Mon Jul 16 13:52:49 2007 From: f.fuchs at gmx.net (Franz Fuchs) Date: Mon, 16 Jul 2007 22:52:49 +0200 Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library Message-ID: http://www.aaronsw.com/weblog/openlibrary links to http://demo.openlibrary.org/about From Bowerbird at aol.com Mon Jul 16 14:17:52 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 16 Jul 2007 17:17:52 EDT Subject: [gutvol-d] Speaking of OCR and Captcha... Message-ID: luis von ahn is a brilliant fellow. a macarthur fellow, in fact, and that's a good thing, because i think only a genius could make this effective... let me know if it ends up being worthwhile... -bowerbird ************************************** Get a sneak peak of the all-new AOL at http://discover.aol.com/memed/aolcom30tour -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070716/fd00b87f/attachment.htm From Bowerbird at aol.com Mon Jul 16 14:20:34 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 16 Jul 2007 17:20:34 EDT Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library Message-ID: somebody's gotta clean up that rat's nest over at internet archive. and aaron might be just the guy. i'd call him a boy genius, except he is no longer a boy, and he hasn't won the macarthur prize. yet. let me know if this ends up being worthwhile... -bowerbird ************************************** Get a sneak peak of the all-new AOL at http://discover.aol.com/memed/aolcom30tour -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070716/64212d87/attachment.htm From bzg at altern.org Mon Jul 16 16:47:31 2007 From: bzg at altern.org (Bastien) Date: Tue, 17 Jul 2007 01:47:31 +0200 Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library In-Reply-To: (Bowerbird@aol.com's message of "Mon\, 16 Jul 2007 17\:20\:34 EDT") References: Message-ID: <87odibc3n0.fsf@bzg.ath.cx> Bowerbird at aol.com writes: > and aaron might be just the guy. i'd call him a boy genius, except > he is no longer a boy, and he hasn't won the macarthur prize. yet. Hey! This list was more fun when you were the only "genius" around this place... i'm certainly getting a bit old. -- Bastien From Bowerbird at aol.com Mon Jul 16 19:59:45 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 16 Jul 2007 22:59:45 EDT Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library Message-ID: bastien said: > i'm certainly getting a bit old. i'm an old fart too... :+) but aaron, he's a whippersnapper. he was an internet celebrity at 16. (for coding, not for dating paris...) now, at 21, he's a college dropout. (stanford, if i remember correctly...) oh well, didn't seem to hurt bill gates _or_ steve jobs, i'm sure he'll recover. ;+) -bowerbird p.s. aaron is also the co-inventor of "markdown", the famous light markup. and i see he worked it into the new site. ************************************** Get a sneak peek of the all-new AOL at http://discover.aol.com/memed/aolcom30tour -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070716/7826a307/attachment.htm From robert_marquardt at gmx.de Mon Jul 16 23:46:57 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Tue, 17 Jul 2007 08:46:57 +0200 Subject: [gutvol-d] 10.000 downloads of the SF CD Message-ID: <7cpo93hh63hun29mbdgbbmo7fpafbc6aeh@4ax.com> Should happen today. A good 1.4 terabyte of downloads. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From bzg at altern.org Tue Jul 17 09:00:44 2007 From: bzg at altern.org (Bastien) Date: Tue, 17 Jul 2007 18:00:44 +0200 Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library In-Reply-To: (Franz Fuchs's message of "Mon\, 16 Jul 2007 22\:52\:49 +0200") References: Message-ID: <877iozqatv.fsf@bzg.ath.cx> "Franz Fuchs" writes: > http://www.aaronsw.com/weblog/openlibrary > http://demo.openlibrary.org/about I think it might be interesting to connect the Open Library and the Freebase Project: http://www.freebase.com -- Bastien From hart at pglaf.org Thu Jul 26 09:01:12 2007 From: hart at pglaf.org (Michael Hart) Date: Thu, 26 Jul 2007 09:01:12 -0700 (PDT) Subject: [gutvol-d] !@!Re: OCR question (fwd) Message-ID: ---------- Forwarded message ---------- Date: Thu, 26 Jul 2007 11:55:46 -0400 From: Zack To: Michael S. Hart Subject: Re: OCR question Sounds good, thanks. Michael Hart wrote: > > With you permission, I will forward you question around. > > Michael > > > On Sun, 22 Jul 2007, Zack wrote: > >> Hello, >> >> I have a photocopy of an 1836 book that I made >> in a library, and I'd like to OCR it and submit it to >> your project. However most of the page images >> of pages that were not flat when I photographed them, >> and I was wondering if you might know of any program >> that can transcribe text that is curved. My hope is >> that some computer science grad student somewhere >> has worked on this problem and found a way to >> find the lines of text even when they are curved, and >> has made his/her software free. >> >> Thanks, >> Zack Smith >> > From greg at durendal.org Thu Jul 26 09:51:45 2007 From: greg at durendal.org (Greg Weeks) Date: Thu, 26 Jul 2007 12:51:45 -0400 (EDT) Subject: [gutvol-d] !@!Re: OCR question (fwd) In-Reply-To: References: Message-ID: >> On Sun, 22 Jul 2007, Zack wrote: >> >>> Hello, >>> >>> I have a photocopy of an 1836 book that I made >>> in a library, and I'd like to OCR it and submit it to >>> your project. However most of the page images >>> of pages that were not flat when I photographed them, >>> and I was wondering if you might know of any program >>> that can transcribe text that is curved. My hope is >>> that some computer science grad student somewhere >>> has worked on this problem and found a way to >>> find the lines of text even when they are curved, and >>> has made his/her software free. A program called unpaper can undo much of the distortion. I use the gimp and do it by hand usually. -- Greg Weeks http://durendal.org:8080/greg/ From piggy at netronome.com Thu Jul 26 11:18:51 2007 From: piggy at netronome.com (La Monte Henry Piggy Yarroll) Date: Thu, 26 Jul 2007 14:18:51 -0400 Subject: [gutvol-d] !@!Re: OCR question (fwd) In-Reply-To: References: Message-ID: <46A8E58B.7010408@netronome.com> Greg Weeks wrote: >>> On Sun, 22 Jul 2007, Zack wrote: >>> >>> >>>> Hello, >>>> >>>> I have a photocopy of an 1836 book that I made >>>> in a library, and I'd like to OCR it and submit it to >>>> your project. However most of the page images >>>> of pages that were not flat when I photographed them, >>>> and I was wondering if you might know of any program >>>> that can transcribe text that is curved. My hope is >>>> that some computer science grad student somewhere >>>> has worked on this problem and found a way to >>>> find the lines of text even when they are curved, and >>>> has made his/her software free. >>>> > > A program called unpaper can undo much of the distortion. I use the gimp > and do it by hand usually. > > There is an web service which does a pretty good job here: http://quito.informatik.uni-kl.de/dewarp/dewarp.php I wish they would just release source code instead, then we could fix the fact that it only works with jpeg images.