From Bowerbird at aol.com Thu Nov 1 08:56:55 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 1 Nov 2007 11:56:55 EDT Subject: [gutvol-d] a post about the .html versions Message-ID: here's a message that just came across on the rocketbook listserve... > Some Gutenberg books are now available in HTML format, which is > a great improvement on simple text when used on the REB1200. > However the newer ones use CSS extensively to create books that > look great when read in a browser, but like crap when converted to > IMP by the ebook librarian (I'm using the breeno one). e.g. text is > indented and truncated, pictures are distorted, etc. I recall that > there were a set of HTML format rules somewhere that specified > exactly what was legal. Does anyone have a link to it? > Are all CSS styles ignored cleanly? Any ideas or comments? for your information... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071101/85cb9bcb/attachment.htm From Bowerbird at aol.com Thu Nov 1 13:10:35 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 1 Nov 2007 16:10:35 EDT Subject: [gutvol-d] how to be typographically beautiful Message-ID: i'd certainly like to talk about e-book topics... even if nobody else here seems to want to... :+) so anyway, here's one... there's widespread agreement out in the world that project gutenberg's e-texts are extremely useful... but the flip-side is that plain-ascii is _quite_ugly_. what i've noticed, though, is that it doesn't take much to bring them up to the standards of book typography. furthermore, i've noticed that there's general agreement what needs to be done, as evidenced by common practice. the best examples are what is done by d.p. post-producers when they create an .html version. but i have also examined a .pdf library produced by the people over at planetpdf.com, the books from blackmask, manybooks, feedbooks, etc., and whatever other various conversions i could get my hands on, including those geared to handhelds (rocketbook, sony, etc.). even the various layouts of blogs and web-pages are useful, since they are keyed to make electronic text more readable... the idea is that you've loaded a plain-ascii p.g. e-text into your word-processor or desktop-publishing program with the objective of making it beautiful. what exactly do you do? please add to this, the start of a list, off the top of my head: 1. get rid of that ugly legalese at the top of the file. 2. make the title-page and front-matter look nice. 3. hotlink the table of contents. make one if necessary. 4. make all the headers big, bold, and distinctive, and 5. start chapters on a new page, maybe even a recto. 6. get rid of the empty lines between paragraphs, and 7. use book-style indents on each paragraph instead. 8. use full justification. or at least half-ragged. 9. use a reasonable line-width. full-screen is too wide. 10. white-space is free in an e-book, so use it liberally. 11. make block-quotes distinctive, for remix purposes. 12. links are great, but spare us the ugly blue underlines. 13. is an unlucky number. 14. don't put pagenumbers inside the text/paragraphs. 15. turn pg-ascii underscored text into _real_ italics. 16. pictures (even doodad thingees) enliven the text. 17. navigation aids among chapters are quite useful. 18. footnotes should have links going _both_ ways. 19. if it works better that way, turn a table on its side. 20. resize tables and images so they fit on one screen. 21. give your readers the luxury of generous leading! 22. (leaving some space for you...) 23. (leaving some space for you...) 24. (leaving some space for you...) 25. (leaving some space for you...) 26. show where we are in the book (page 39 of 208). 27. make the framework of the document _obvious_. 28. what the heck, just for the fun of it, make an index! 29. make the typesize big enough to be read easily! 30. get rid of that ugly legalese at the bottom of the file. these are general strategies. not all of them will be applicable to any one specific situation, and some (e.g., #8) are up to the preferences of the individual. and obviously, some of these could be fragmented into a very large number of sub-points, like #10... but these are the tricks that i've seen being used to bring some typographical beauty to p.g. e-texts... -bowerbird p.s. feedbooks.com creates very beautiful e-books... ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071101/598843be/attachment.htm From lee at novomail.net Thu Nov 1 15:02:12 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 01 Nov 2007 15:02:12 -0700 Subject: [gutvol-d] More useless data Message-ID: <472A4CE4.4020704@novomail.net> Quick ... what is the most commonly downloaded book from project Gutenberg in the last three years? I promised Jon Noring some data a few months back, and I thought I'd deliver it in this forum, because some other people might find it interesting. As most people here know, TPTB at project Gutenberg deny having any download statistics beyond the past 30 days. Fortunately, for years now the Internet Archive has been trolling the internet, making periodic snapshots of web sites, including Project Gutenberg. So I went to the Internet Archive and captured the Project Gutenberg statistics pages since September 2004. I collated all the data, and came up with a list of 408 files which have appeared in the "30 day - Top 100" since that date. I added and resorted them, and now have a list of the most popular downloads from the PG web site since Sept. 2004. And the most popular download during the past three years is: (drumroll please) The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci (etext 5000) The rest of the top ten are: 2 The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (9551 & 1661) 3 The Art of War by Sun Tzu (132, 17405 & 20594) 4 Le Kamasutra by Vatsyayana (14609) 5 Pride and Prejudice by Jane Austen (1342 & 20686) 6 The War of the Worlds by H. G. Wells (36 & 8976) 7 Ulysses by James Joyce (4300) 8 Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert Hubbard (12933) 9 Manual of Surgery by Alexander Miles and Alexis Thomson (17921) 10 Hand Shadows to Be Thrown upon the Wall by Henry Bursill (12962) 11 The Adventures of Huckleberry Finn by Mark Twain (76 & 19640) 12 Alice's Adventures in Wonderland by Lewis Carroll (11, 19573 & 928) (I know this is 12, but I couldn't bear to leave out Alice and Huck.) (Caveat: I was unable to get precise 30 day intervals, so this list is an approximation. A /very good/ approximation, but an approximation nonetheless.) (Caveat bis: These data are derived from that reported on the PG web site. They are only as good as PG's reporting.) Of course, because the PG corpus is always growing, this kind of linear analysis may over-weight early downloads. So I changed the collation algorithm a bit. I started with a 6 month baseline, and then as I added each 30 day list I increased the weighting by 4%. That is, the data as of Feb. 2005 was counted at 100%, but the data from Feb-Mar was counted at 104%, the data from Mar-Apr was counted at 108%, the data from Apr-May was counted at 112%, etc. Thus, more recent downloads got counted more heavily that more distant downloads. So what is the adjusted top ten list? 1 The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci (5000) 2 The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (9551 & 1661) 3 The Art of War by Sun Tzu (132, 17405 & 20594) 4 Le Kamasutra by Vatsyayana (14609) 5 Pride and Prejudice by Jane Austen (1342 & 20686) 6 Manual of Surgery by Alexander Miles and Alexis Thomson (17921) 7 How to Speak and Write Correctly by Joseph Devlin (6409) 8 Ulysses by James Joyce (4300) 9 Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert Hubbard (12933) 10 The War of the Worlds by H. G. Wells (36 & 8976) 11 The Adventures of Huckleberry Finn by Mark Twain (76 & 19640) As you can see, the Manual of Surgery is more popular recently, and Hand Shadows less so. Alice dropped to 14, so I didn't feel like I could include her. What is interesting is that the addition of new files to the PG corpus has not had much affect on the most popular file downloads. The data for all 400+ files can be found at http://www.passkeysoft.com/~lee/zero.txt and http://www.passkeysoft.com/~lee/four.txt. Bowerbird, if you want to know where to start in your conversion process to z.m.l., I would suggest the books on this list. I hand manipulated the first 50 entries in each file, to try to count multiple editions of the same book as a single entry, the remaining data is raw. Enjoy! -- Nothing of significance below this line. From hart at pglaf.org Thu Nov 1 14:48:20 2007 From: hart at pglaf.org (Michael Hart) Date: Thu, 1 Nov 2007 14:48:20 -0700 (PDT) Subject: [gutvol-d] !@!Re: More useless data In-Reply-To: <472A4CE4.4020704@novomail.net> References: <472A4CE4.4020704@novomail.net> Message-ID: With your permission, I'd like to include your files, and perhaps this report, in a Project Gutenberg file. Thanks!!! Michael S. Hart Founder Project Gutenberg On Thu, 1 Nov 2007, Lee Passey wrote: > Quick ... what is the most commonly downloaded book from project > Gutenberg in the last three years? > > I promised Jon Noring some data a few months back, and I thought I'd > deliver it in this forum, because some other people might find it > interesting. > > As most people here know, TPTB at project Gutenberg deny having any > download statistics beyond the past 30 days. Fortunately, for years now > the Internet Archive has been trolling the internet, making periodic > snapshots of web sites, including Project Gutenberg. So I went to the > Internet Archive and captured the Project Gutenberg statistics pages > since September 2004. I collated all the data, and came up with a list > of 408 files which have appeared in the "30 day - Top 100" since that > date. I added and resorted them, and now have a list of the most popular > downloads from the PG web site since Sept. 2004. > > And the most popular download during the past three years is: > > (drumroll please) > > The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci > (etext 5000) > > The rest of the top ten are: > > 2 The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (9551 & > 1661) > 3 The Art of War by Sun Tzu (132, 17405 & 20594) > 4 Le Kamasutra by Vatsyayana (14609) > 5 Pride and Prejudice by Jane Austen (1342 & 20686) > 6 The War of the Worlds by H. G. Wells (36 & 8976) > 7 Ulysses by James Joyce (4300) > 8 Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert > Hubbard (12933) > 9 Manual of Surgery by Alexander Miles and Alexis Thomson (17921) > 10 Hand Shadows to Be Thrown upon the Wall by Henry Bursill (12962) > 11 The Adventures of Huckleberry Finn by Mark Twain (76 & 19640) > 12 Alice's Adventures in Wonderland by Lewis Carroll (11, 19573 & 928) > > (I know this is 12, but I couldn't bear to leave out Alice and Huck.) > > (Caveat: I was unable to get precise 30 day intervals, so this list is > an approximation. A /very good/ approximation, but an approximation > nonetheless.) > > (Caveat bis: These data are derived from that reported on the PG web > site. They are only as good as PG's reporting.) > > Of course, because the PG corpus is always growing, this kind of linear > analysis may over-weight early downloads. So I changed the collation > algorithm a bit. I started with a 6 month baseline, and then as I added > each 30 day list I increased the weighting by 4%. That is, the data as > of Feb. 2005 was counted at 100%, but the data from Feb-Mar was counted > at 104%, the data from Mar-Apr was counted at 108%, the data from > Apr-May was counted at 112%, etc. Thus, more recent downloads got > counted more heavily that more distant downloads. > > So what is the adjusted top ten list? > > 1 The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci > (5000) > 2 The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (9551 & > 1661) > 3 The Art of War by Sun Tzu (132, 17405 & 20594) > 4 Le Kamasutra by Vatsyayana (14609) > 5 Pride and Prejudice by Jane Austen (1342 & 20686) > 6 Manual of Surgery by Alexander Miles and Alexis Thomson (17921) > 7 How to Speak and Write Correctly by Joseph Devlin (6409) > 8 Ulysses by James Joyce (4300) > 9 Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert > Hubbard (12933) > 10 The War of the Worlds by H. G. Wells (36 & 8976) > 11 The Adventures of Huckleberry Finn by Mark Twain (76 & 19640) > > As you can see, the Manual of Surgery is more popular recently, and Hand > Shadows less so. Alice dropped to 14, so I didn't feel like I could > include her. What is interesting is that the addition of new files to > the PG corpus has not had much affect on the most popular file downloads. > > The data for all 400+ files can be found at > http://www.passkeysoft.com/~lee/zero.txt and > http://www.passkeysoft.com/~lee/four.txt. > > Bowerbird, if you want to know where to start in your conversion process > to z.m.l., I would suggest the books on this list. > > I hand manipulated the first 50 entries in each file, to try to count > multiple editions of the same book as a single entry, the remaining data > is raw. > > Enjoy! > > -- > Nothing of significance below this line. > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From lee at novomail.net Thu Nov 1 16:36:41 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 01 Nov 2007 16:36:41 -0700 Subject: [gutvol-d] !@!Re: More useless data In-Reply-To: References: <472A4CE4.4020704@novomail.net> Message-ID: <472A6309.9020502@novomail.net> Michael Hart wrote: > > With your permission, I'd like to include your files, > and perhaps this report, in a Project Gutenberg file. No need to ask for permission; anything I post to a public forum I always dedicate to the public domain. -- Nothing of significance below this line. From marcello at perathoner.de Thu Nov 1 15:57:04 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 01 Nov 2007 23:57:04 +0100 Subject: [gutvol-d] More useless data In-Reply-To: <472A4CE4.4020704@novomail.net> References: <472A4CE4.4020704@novomail.net> Message-ID: <472A59C0.1090809@perathoner.de> Lee Passey wrote: > And the most popular download during the past three years is: > > (drumroll please) > > The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci > (etext 5000) The "Notebooks" were featured in a prominent blog in a "read a page a day" series. Probably the same couple hundred people accessed the whole book every day just to read the "page of the day". This went on for almost 2 years. The "Manual of Surgery" gets requested almost exclusively referring to a google-image search containing the words "penis enlargement". But you cleanly missed the *5 most popular PG downloads of all times!!!* (Because I filter them or they would have topped the list every day from day one to eternity). Try to guess ... Give up? Scroll down ... Drum roll !!! #6557 The Fall of the House of Usher.mp3 #9695 Bleak House by Charles Dickens.mp3 #6550 The House of Mapuhi by Jack London.mp3 #9280 House of Mirth by Edith Wharton.mp3 #9714 A House to Let by Charles Dickens.mp3 Explaining the rationale behind the filtering is left as an exercise to Lee. -- Marcello Perathoner webmaster at gutenberg.org From hart at pglaf.org Thu Nov 1 18:30:50 2007 From: hart at pglaf.org (Michael Hart) Date: Thu, 1 Nov 2007 18:30:50 -0700 (PDT) Subject: [gutvol-d] More useless data In-Reply-To: <472A59C0.1090809@perathoner.de> References: <472A4CE4.4020704@novomail.net> <472A59C0.1090809@perathoner.de> Message-ID: And we leave the rationale that all five on your list contain: "House" to whom? On Thu, 1 Nov 2007, Marcello Perathoner wrote: > Lee Passey wrote: > >> And the most popular download during the past three years is: >> >> (drumroll please) >> >> The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci >> (etext 5000) > > The "Notebooks" were featured in a prominent blog in a "read a page a > day" series. Probably the same couple hundred people accessed the whole > book every day just to read the "page of the day". This went on for > almost 2 years. > > The "Manual of Surgery" gets requested almost exclusively referring to a > google-image search containing the words "penis enlargement". > > > But you cleanly missed the > > *5 most popular PG downloads of all times!!!* > > (Because I filter them or they would have topped the list every day from > day one to eternity). > > Try to guess ... > > Give up? > > Scroll down ... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Drum roll !!! > > > #6557 The Fall of the House of Usher.mp3 > #9695 Bleak House by Charles Dickens.mp3 > #6550 The House of Mapuhi by Jack London.mp3 > #9280 House of Mirth by Edith Wharton.mp3 > #9714 A House to Let by Charles Dickens.mp3 > > > Explaining the rationale behind the filtering is left as an exercise to Lee. > > > > -- > Marcello Perathoner > webmaster at gutenberg.org > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From j.hagerson at comcast.net Thu Nov 1 19:08:46 2007 From: j.hagerson at comcast.net (John Hagerson) Date: Thu, 1 Nov 2007 21:08:46 -0500 Subject: [gutvol-d] More useless data In-Reply-To: Message-ID: <003b01c81cf5$4d501bc0$1f12fea9@sarek> It appears that "House" is a musical genre of relatively recent vintage. These files may have been found by searching "house and mp3." I fear that many of the people who download these particular audio books are disappointed by what they contain. John Hagerson -----Original Message----- From: gutvol-d-bounces at lists.pglaf.org [mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of Michael Hart Sent: Thursday, November 01, 2007 8:31 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] More useless data And we leave the rationale that all five on your list contain: "House" to whom? From Bowerbird at aol.com Fri Nov 2 10:20:42 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 2 Nov 2007 13:20:42 EDT Subject: [gutvol-d] on the importance of remixability Message-ID: kottke, the link blogger, interviews yochai benkler, the author of "the wealth of networks", over here: > http://www.kottke.org/07/11/yochai-benkler on why he made his book free online, benkler says: > But for me what was more important than simply > the freedom to download, was the freedom to > do things with the book. That's why I held out for > licensing the book under a CC noncommercial > sharealike license. The fact that people were able > to take the book and convert it into other formats, > including making readings of some portions; that > some people began to translate portions of the book; > these were the reasons that mattered. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071102/6a2124bf/attachment.htm From Bowerbird at aol.com Fri Nov 2 11:10:01 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 2 Nov 2007 14:10:01 EDT Subject: [gutvol-d] your pudding sampler menu Message-ID: my tool-chain is starting to cohere across the entire workflow, so here's a reminder about the pudding samples available right now. all of these are in-progress, so constructive criticism is welcomed. give -- cross-platform viewer-program for z.m.l. (dated now, but...) > download from the "zml-talk" group at yahoogroups zandbox -- cross-platform z.m.l. authoring tool > backchannel me for a copy banana cream -- cross-platform proofreading engine > backchannel me for a copy babelfish -- prototype web-app viewer-program for z.m.l. > http://z-m-l.com/go/babelfish19.pl verylovely -- canned online zml-to-html conversion demo > http://www.z-m-l.com/go/vl3.pl zmldingus -- live online zml-to-html conversion app > http://www.z-m-l.com/go/zmldingus093.pl "continuous proofreading" mode: various sample books > http://z-m-l.com/go/myant/myantp001.html > http://z-m-l.com/go/mabie/mabiep001.html > http://z-m-l.com/go/tolbk/tolbkp001.html > http://z-m-l.com/go/sgfhb/sgfhbp001.html > http://z-m-l.com/go/ahmmw/ahmmwp001.html .pdf samples -- sample of the zml-to-pdf conversion process > http://z-m-l.com/oyayr/oyayr.zml > http://z-m-l.com/oyayr/oya-sunday.pdf > http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.zml > http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01b.pdf .html samples -- sample of the zml-to-html conversion process > http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.zml > http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.html -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071102/caad43a5/attachment.htm From robert_marquardt at gmx.de Fri Nov 2 23:30:31 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Sat, 03 Nov 2007 07:30:31 +0100 Subject: [gutvol-d] your pudding sampler menu In-Reply-To: References: Message-ID: This message like a few others was put into the spam folder by my mail provider. Interestingly by a human designed filter. Very amusing :-) -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From lee at novomail.net Sat Nov 3 08:45:21 2007 From: lee at novomail.net (Lee Passey) Date: Sat, 03 Nov 2007 08:45:21 -0700 Subject: [gutvol-d] Diff tools Message-ID: <472C9791.9070600@novomail.net> I'm making good progress on my TEIification of Mark Twains _Puddn'head_Wilson_. What I want to do at this point is "diff" one or more versions, including a couple I have OCRed myself. What I /don't/ want to do is strip out markup before performing the diff. Is anyone aware of any tool I can use to diff two (or more) files without degrading or normalizing the text? For example, something that can compare an XHML file with an allegedly identical impoverished text file? From traverso at posso.dm.unipi.it Sat Nov 3 13:48:05 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Sat, 3 Nov 2007 21:48:05 +0100 (CET) Subject: [gutvol-d] Diff tools In-Reply-To: <472C9791.9070600@novomail.net> (message from Lee Passey on Sat, 03 Nov 2007 08:45:21 -0700) References: <472C9791.9070600@novomail.net> Message-ID: <20071103204805.1ACF993B66@posso.dm.unipi.it> >>>>> "Lee" == Lee Passey writes: Lee> I'm making good progress on my TEIification of Mark Twains Lee> _Puddn'head_Wilson_. What I want to do at this point is Lee> "diff" one or more versions, including a couple I have OCRed Lee> myself. What I /don't/ want to do is strip out markup before Lee> performing the diff. Lee> Is anyone aware of any tool I can use to diff two (or more) Lee> files without degrading or normalizing the text? For example, Lee> something that can compare an XHML file with an allegedly Lee> identical impoverished text file? I fear that I do not understand. It isn't clear to me what you want to have as result: do you want a list of differences, including those originating from the markup? Or you want to build a version with markup including markup for the variants of the text? I personally would like to have the second, and I more or less know how I would build a tool to get from a TEI file and a TXT file a TEI file with the variants marked, with some manual tweaking necessary where the modifications cross other markup. The key ingredients would be wdiff and some code for diff analysis that I already have. Carlo From Bowerbird at aol.com Sat Nov 3 14:50:35 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 3 Nov 2007 17:50:35 EDT Subject: [gutvol-d] Diff tools Message-ID: c'mon, guys, .tei is a worldwide standard, so there just has to be _lots_and_lots_ of diff tools that will do whatever anyone wants... you're just not _looking_ hard enough... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071103/b94742d0/attachment.htm From jeroen.mailinglist at bohol.ph Sat Nov 3 15:39:43 2007 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Sat, 03 Nov 2007 23:39:43 +0100 Subject: [gutvol-d] Diff tools In-Reply-To: <20071103204805.1ACF993B66@posso.dm.unipi.it> References: <472C9791.9070600@novomail.net> <20071103204805.1ACF993B66@posso.dm.unipi.it> Message-ID: <472CF8AF.1010907@bohol.ph> I am not aware of any open source tool that does this, but Beyond Compare allows you to specify filters to run before the compare itself, which you can use to filter out tags, etc., without modifying the files in question. This will allow you to find differences in the character sequences, while ignoring markup. It shouldn't be too much work to build similar functionality in one of the many open source alternatives available. Jeroen Carlo Traverso wrote: >>>>>> "Lee" == Lee Passey writes: >>>>>> > > Lee> I'm making good progress on my TEIification of Mark Twains > Lee> _Puddn'head_Wilson_. What I want to do at this point is > Lee> "diff" one or more versions, including a couple I have OCRed > Lee> myself. What I /don't/ want to do is strip out markup before > Lee> performing the diff. > > Lee> Is anyone aware of any tool I can use to diff two (or more) > Lee> files without degrading or normalizing the text? For example, > Lee> something that can compare an XHML file with an allegedly > Lee> identical impoverished text file? > > I fear that I do not understand. > > It isn't clear to me what you want to have as result: do you want a > list of differences, including those originating from the markup? > > Or you want to build a version with markup including markup for the > variants of the text? > > I personally would like to have the second, and I more or less know > how I would build a tool to get from a TEI file and a TXT file a TEI > file with the variants marked, with some manual tweaking necessary > where the modifications cross other markup. The key ingredients would > be wdiff and some code for diff analysis that I already have. > > Carlo > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From marcello at perathoner.de Sat Nov 3 15:45:13 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 03 Nov 2007 23:45:13 +0100 Subject: [gutvol-d] Diff tools In-Reply-To: <472C9791.9070600@novomail.net> References: <472C9791.9070600@novomail.net> Message-ID: <472CF9F9.8060005@perathoner.de> Lee Passey wrote: > What I want to do at this point is "diff" one or > more versions, including a couple I have OCRed myself. What I /don't/ > want to do is strip out markup before performing the diff. Do you want to compare the text or the tagging? If you want to compare the text, I strongly advise to strip the markup before doing the diff. > Is anyone aware of any tool I can use to diff two (or more) files > without degrading or normalizing the text? For example, something that > can compare an XHML file with an allegedly identical impoverished text file? You get more than 2 megagoogles for "xml diff". I guess there might be something for you in it. -- Marcello Perathoner webmaster at gutenberg.org From lee at novomail.net Sun Nov 4 10:37:11 2007 From: lee at novomail.net (Lee Passey) Date: Sun, 04 Nov 2007 11:37:11 -0700 Subject: [gutvol-d] Diff tools In-Reply-To: <20071103204805.1ACF993B66@posso.dm.unipi.it> References: <472C9791.9070600@novomail.net> <20071103204805.1ACF993B66@posso.dm.unipi.it> Message-ID: <472E1157.7080004@novomail.net> Carlo Traverso wrote: [snip] > It isn't clear to me what you want to have as result: do you want a > list of differences, including those originating from the markup? > > Or you want to build a version with markup including markup for the > variants of the text? I'm afraid that I may have sacrificed clarity for brevity. What I am looking for is more the second than the first. Let me see if I can illustrate with a few use cases. I have a 24-bit full-color image of page 56 of a particular edition of Puddn'head. I take that scan and downsample it in various ways resulting in 10 additional images, which may be gray-scale or black and white using different threshold values. I ran all 11 images through ABBYY, with various degrees of success. In three of the 11 result files one word was mis-recognized all in the same way. In four of the 11 result files one word was mis-recognized in different ways. In three of the 11 result files one word was incorrectly characterized as italic. What I want is a process by which I can diff all the versions, and via a voting algorithm "fix" the errors (inserting a marker so a human can revisit the change later). In two of these cases surrounding markup is irrelevant, but in one of them it is significant. To be honest, I don't really think I'm going to find a tool that will do precisely what I want, but I'm hoping to find some components I can cobble together to get close. In a second use case, I have an OCRed version of Puddn'head from a scan set obtained from Google. I also have the Google OCR text, but the text has been saved without markup. I want a process whereby I can compare the Google OCR text to my OCR marked up text (and perhaps texts from other sources as well, such as the Internet Archive), giving me an output that I can use in an automated procedure to merge changes back into the /marked/ text. The key here is automation; removing the markup, normalizing the files, diffing the normalized files, and then relying on a human to search the marked up version to find where changes need to be made is not the desired outcome. > I personally would like to have the second, and I more or less know > how I would build a tool to get from a TEI file and a TXT file a TEI > file with the variants marked, with some manual tweaking necessary > where the modifications cross other markup. The key ingredients would > be wdiff and some code for diff analysis that I already have. I think this is very close to what I'm looking for. From lee at novomail.net Sun Nov 4 10:48:27 2007 From: lee at novomail.net (Lee Passey) Date: Sun, 04 Nov 2007 11:48:27 -0700 Subject: [gutvol-d] Diff tools In-Reply-To: <472CF9F9.8060005@perathoner.de> References: <472C9791.9070600@novomail.net> <472CF9F9.8060005@perathoner.de> Message-ID: <472E13FB.1050900@novomail.net> Marcello Perathoner wrote: [snip] > You get more than 2 megagoogles for "xml diff". I guess there might be > something for you in it. And therein lies the problem. The solution may lie in the 2 megagoogles, but I probably won't ever find it; there's just too many results. An alternative to the Google brute force method (which so far has already led me to "HTML Compare", which unfortunately is too UI oriented) is a more targeted approach by posting a message to a forum or two where there is a possibility that someone has already encountered a tool similar to that which I am seeking, and can give me a more direct pointer. Both approaches are useful, and usually complementary. From Bowerbird at aol.com Sun Nov 4 13:52:13 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 4 Nov 2007 16:52:13 EST Subject: [gutvol-d] Diff tools Message-ID: carlo said: > Lee> Is anyone aware of any tool I can use to diff two (or more) > Lee> files without degrading or normalizing the text? For example, > Lee> something that can compare an XHML file with an allegedly > Lee> identical impoverished text file? > > I fear that I do not understand. > > It isn't clear to me what you want to have as result: do you want a > list of differences, including those originating from the markup? > > Or you want to build a version with markup including markup for the > variants of the text? c'mon carlo, it's obvious what lee wants... he wants to use the "impoverished text file" to compare to -- and make corrections to -- the file he has already marked up. perhaps you could suggest to him that he should apply markup to the "impoverished text file" -- remember, it's _easy_ to do! -- after which he can do a straightforward comparison of the files... :+) (but for those of you who contemplate doing this in the future, make the corrections _first_, and only _then_ apply the markup.) -bowerbird p.s. i see lee has made more posts, so i might have to take him out of my spam folder for this thread, since this promises to be _very_ juicy... let's see if i can be strong enough to avoid this temptation! ;+) ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071104/1dd5aea2/attachment.htm From Bowerbird at aol.com Mon Nov 5 14:26:38 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 5 Nov 2007 17:26:38 EST Subject: [gutvol-d] give one, get a year Message-ID: thinking of doing that give-one-get-one on the o.l.p.c.? t-mobile just sweetened the deal for you, with a free year of hotspot wi-fi... > http://www.olpcnews.com/laptops/xo1/olpc_xo_sales_commitments.html -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071105/7d0d9737/attachment.htm From jon at noring.name Mon Nov 5 17:48:55 2007 From: jon at noring.name (Jon Noring) Date: Mon, 5 Nov 2007 18:48:55 -0700 Subject: [gutvol-d] [forward] ANN: P5 Version 1.0 of the TEI Guidelines has been released Message-ID: <541883530.20071105184855@noring.name> [Posted to TEI-L by Christian Wittern , Institute for Research in Humanities, Kyoto University. Forwarding it here for those interested. Jon] Dear TEI users, After more than 6 years of, at times quite intensive, development, it is with great pleasure that I announce the release of version 1.0 of P5, the latest and greatest version of the Guidelines of the Text Encoding Initiative, which officially happened Nov. 2 at the TEI Members Meeting in Maryland. You will find the new version online at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html; a PDF and even printed books are expected to appear in due time. The main development work has been carried out by the TEI Technical Council and the Editors Lou Burnard and Syd Bauman. I would like to take this opportunity to warmly thank all previous members of the Council but also especially the current members, who shared quite a big of the work, which magically increased as the release date was approaching: David Birnbaum, Tone Merete Bruvik, Arianna Ciula, James Cummings, Matthew Driscoll, Daniel O'Donnel, Dot Porter, Sebastian Rahtz Laurent Romary, Conal Tuohy, John Walsh. Christian Wittern Chair, TEI Technical Council -- Christian Wittern Institute for Research in Humanities, Kyoto University 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN From piggy at netronome.com Tue Nov 6 10:09:35 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Tue, 06 Nov 2007 13:09:35 -0500 Subject: [gutvol-d] Diff tools In-Reply-To: <472E13FB.1050900@novomail.net> References: <472C9791.9070600@novomail.net> <472CF9F9.8060005@perathoner.de> <472E13FB.1050900@novomail.net> Message-ID: <4730ADDF.10909@netronome.com> Lee Passey wrote: > Marcello Perathoner wrote: > > [snip] > > >> You get more than 2 megagoogles for "xml diff". I guess there might be >> something for you in it. >> > > And therein lies the problem. The solution may lie in the 2 megagoogles, > but I probably won't ever find it; there's just too many results. > > An alternative to the Google brute force method (which so far has > already led me to "HTML Compare", which unfortunately is too UI > oriented) is a more targeted approach by posting a message to a forum or > two where there is a possibility that someone has already encountered a > tool similar to that which I am seeking, and can give me a more direct > pointer. > > Both approaches are useful, and usually complementary. > I see that ubuntu gutsy has xmldiff which sounds like it can solve the xml-xml problem. When I encounter the megagoogle problem (and what I'm looking for is not on the first page), I turn to clusty.com. (In the interests of full disclosure: I have several close friends who work for the company that runs clusty.com.) From Bowerbird at aol.com Tue Nov 6 13:12:09 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 6 Nov 2007 16:12:09 EST Subject: [gutvol-d] Diff tools Message-ID: piggy said: > I see that ubuntu gutsy has xmldiff which > sounds like it can solve the xml-xml problem. oh geez, i thought the absence of messages meant that lee had gotten a solution. instead, it looks like it means _no_one_ has a solution for this relatively common task... and yes, i know that if you have to _explain_ a joke, that means it's not very _funny_... but i suppose _some_ people got the laugh when i made my earlier message, so i don't see the harm in explaining it to those who have no sense of humor... the xml/tei crowd loved to tell you, over the years, how -- because there's so many institutions using heavy markup --there are all kinds of tools now for dealing with it, and we could depend on open-source to create even more... but the fact of the matter is that, when they get around to actually digitizing a book, those tools seem to disappear... indeed, here is a _very_basic_ task -- comparing a new digitization to a previous one, to find the differences -- and nobody's stepping up to say "here's a tool to do it..." let alone _lots_ of people pointing to _lots_ of such tools. nobody is saying, "let me go and ask on another listserve, where i'm sure they'll give us some answers right away..." and folks, this is for an extremely straightforward e-text! > http://www.gutenberg.org/etext/102 indeed, i've even included it as one of my z.m.l. demo-files: > http://z-m-l.com/go/vl3.pl > http://www.z-m-l.com/go/vlpuddnhead.zml there's little in this e-text that requires even light-markup, let alone heavy-markup: chapter headings, epigraphs, and not a whole lot more, if i remember it correctly... sheesh... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071106/fa92b7b2/attachment.htm From Bowerbird at aol.com Wed Nov 7 11:09:52 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 7 Nov 2007 14:09:52 EST Subject: [gutvol-d] p.o.d. of the entire catalog Message-ID: at some point in time down the line, i'll be able to offer p.o.d. of the entire p.g. catalog. as i have said before, i will direct part of the proceeds to the p.g. foundation and part to michael hart for his longstanding devotion. one question here now is about the i.s.b.n. numbers. (and yes, i know the "n" at the end of "i.s.b.n." stands for "number", so that "i.s.b.n. numbers" is redundant.) i.s.b.n. are much cheaper in big blocks than small ones. immensely so. (because they are a part of the system that's designed to impose a high cost of entry on any small publishers, to the benefit of the larger houses.) would p.g. be willing to pick up the cost of the i.s.b.n.? it'd likely be smartest to buy a block of 50,000 or so... what conditions, if any, would be imposed in return? and how long would an official decision on this take, from request to approval to the issuance of a check? please understand that this is _not_ asking for a "favor". i've asked for favors, like when i asked for web-space. i have no trouble discerning when i'm asking for a favor, or saying that's what i'm doing. but this is not that case. either way, it's not going to make any difference at all in the amount of money that p.g. ultimately receives, because if the decision is a "no", i'll absorb the cost, so the upshot is that it means it'll be that much longer until the project moves into the black and p.g. gets any cash. so either way, p.g. will underwrite it, directly or indirectly. so the only question is whether p.g. owns the i.s.b.n. or not, and thus has the ability to continue selling the publications once i go to heaven... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/92e8e842/attachment.htm From hart at pglaf.org Wed Nov 7 12:13:31 2007 From: hart at pglaf.org (Michael Hart) Date: Wed, 7 Nov 2007 12:13:31 -0800 (PST) Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog In-Reply-To: References: Message-ID: What do 50,000 ISBN's cost??? mh On Wed, 7 Nov 2007, Bowerbird at aol.com wrote: > at some point in time down the line, i'll be able to offer > p.o.d. of the entire p.g. catalog. as i have said before, > i will direct part of the proceeds to the p.g. foundation > and part to michael hart for his longstanding devotion. > > one question here now is about the i.s.b.n. numbers. > (and yes, i know the "n" at the end of "i.s.b.n." stands > for "number", so that "i.s.b.n. numbers" is redundant.) > > i.s.b.n. are much cheaper in big blocks than small ones. > immensely so. (because they are a part of the system > that's designed to impose a high cost of entry on any > small publishers, to the benefit of the larger houses.) > > would p.g. be willing to pick up the cost of the i.s.b.n.? > it'd likely be smartest to buy a block of 50,000 or so... > what conditions, if any, would be imposed in return? > and how long would an official decision on this take, > from request to approval to the issuance of a check? > > please understand that this is _not_ asking for a "favor". > i've asked for favors, like when i asked for web-space. > i have no trouble discerning when i'm asking for a favor, > or saying that's what i'm doing. but this is not that case. > > either way, it's not going to make any difference at all > in the amount of money that p.g. ultimately receives, > because if the decision is a "no", i'll absorb the cost, so > the upshot is that it means it'll be that much longer until > the project moves into the black and p.g. gets any cash. > so either way, p.g. will underwrite it, directly or indirectly. > > so the only question is whether p.g. owns the i.s.b.n. or not, > and thus has the ability to continue selling the publications > once i go to heaven... > > -bowerbird > > > > ************************************** > See what's new at http://www.aol.com > From creeva at gmail.com Wed Nov 7 12:19:50 2007 From: creeva at gmail.com (Brent Gueth) Date: Wed, 7 Nov 2007 15:19:50 -0500 Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog In-Reply-To: References: Message-ID: <2510ddab0711071219h64ae8401x46ac424f4ad780f2@mail.gmail.com> According to ISBN.org $1,570.00 per 1000 plus 180 processing fee - that's the highest lot number I could find by purchasing online - bowerbird may have a cheaper lot for getting that large of a lot. On Nov 7, 2007 3:13 PM, Michael Hart wrote: > > What do 50,000 ISBN's cost??? > > mh > > On Wed, 7 Nov 2007, Bowerbird at aol.com wrote: > > > at some point in time down the line, i'll be able to offer > > p.o.d. of the entire p.g. catalog. as i have said before, > > i will direct part of the proceeds to the p.g. foundation > > and part to michael hart for his longstanding devotion. > > > > one question here now is about the i.s.b.n. numbers. > > (and yes, i know the "n" at the end of "i.s.b.n." stands > > for "number", so that "i.s.b.n. numbers" is redundant.) > > > > i.s.b.n. are much cheaper in big blocks than small ones. > > immensely so. (because they are a part of the system > > that's designed to impose a high cost of entry on any > > small publishers, to the benefit of the larger houses.) > > > > would p.g. be willing to pick up the cost of the i.s.b.n.? > > it'd likely be smartest to buy a block of 50,000 or so... > > what conditions, if any, would be imposed in return? > > and how long would an official decision on this take, > > from request to approval to the issuance of a check? > > > > please understand that this is _not_ asking for a "favor". > > i've asked for favors, like when i asked for web-space. > > i have no trouble discerning when i'm asking for a favor, > > or saying that's what i'm doing. but this is not that case. > > > > either way, it's not going to make any difference at all > > in the amount of money that p.g. ultimately receives, > > because if the decision is a "no", i'll absorb the cost, so > > the upshot is that it means it'll be that much longer until > > the project moves into the black and p.g. gets any cash. > > so either way, p.g. will underwrite it, directly or indirectly. > > > > so the only question is whether p.g. owns the i.s.b.n. or not, > > and thus has the ability to continue selling the publications > > once i go to heaven... > > > > -bowerbird > > > > > > > > ************************************** > > See what's new at http://www.aol.com > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/8c2fe8de/attachment.htm From creeva at gmail.com Wed Nov 7 12:21:41 2007 From: creeva at gmail.com (Brent Gueth) Date: Wed, 7 Nov 2007 15:21:41 -0500 Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog In-Reply-To: References: Message-ID: <2510ddab0711071221o37cf506etdeb4f5dee6b89904@mail.gmail.com> I got my information from this page in case I'm misreading it - I'll let you interpret the numbers yourself On Nov 7, 2007 3:13 PM, Michael Hart wrote: > > What do 50,000 ISBN's cost??? > > mh > > On Wed, 7 Nov 2007, Bowerbird at aol.com wrote: > > > at some point in time down the line, i'll be able to offer > > p.o.d. of the entire p.g. catalog. as i have said before, > > i will direct part of the proceeds to the p.g. foundation > > and part to michael hart for his longstanding devotion. > > > > one question here now is about the i.s.b.n. numbers. > > (and yes, i know the "n" at the end of "i.s.b.n." stands > > for "number", so that "i.s.b.n. numbers" is redundant.) > > > > i.s.b.n. are much cheaper in big blocks than small ones. > > immensely so. (because they are a part of the system > > that's designed to impose a high cost of entry on any > > small publishers, to the benefit of the larger houses.) > > > > would p.g. be willing to pick up the cost of the i.s.b.n.? > > it'd likely be smartest to buy a block of 50,000 or so... > > what conditions, if any, would be imposed in return? > > and how long would an official decision on this take, > > from request to approval to the issuance of a check? > > > > please understand that this is _not_ asking for a "favor". > > i've asked for favors, like when i asked for web-space. > > i have no trouble discerning when i'm asking for a favor, > > or saying that's what i'm doing. but this is not that case. > > > > either way, it's not going to make any difference at all > > in the amount of money that p.g. ultimately receives, > > because if the decision is a "no", i'll absorb the cost, so > > the upshot is that it means it'll be that much longer until > > the project moves into the black and p.g. gets any cash. > > so either way, p.g. will underwrite it, directly or indirectly. > > > > so the only question is whether p.g. owns the i.s.b.n. or not, > > and thus has the ability to continue selling the publications > > once i go to heaven... > > > > -bowerbird > > > > > > > > ************************************** > > See what's new at http://www.aol.com > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/2cc7e7b7/attachment.htm From creeva at gmail.com Wed Nov 7 12:45:38 2007 From: creeva at gmail.com (Brent Gueth) Date: Wed, 7 Nov 2007 15:45:38 -0500 Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog In-Reply-To: <2510ddab0711071219h64ae8401x46ac424f4ad780f2@mail.gmail.com> References: <2510ddab0711071219h64ae8401x46ac424f4ad780f2@mail.gmail.com> Message-ID: <2510ddab0711071245s19471a48p17725f57c9e242d2@mail.gmail.com> Since my last mail didn't include the link www.isbn.org has this as their price list - more concise then the link I meant to send anyways ISBN List of Services & Fees REGULAR PROCESSING (15 business days turnaround) ISBN price list (the categories below include the combined processing fees and registration fees): 10 ISBNs: $275.00 100 ISBNs: $995.00 1000 ISBNs: $1,750.00 ________________________________ PRIORITY PROCESSING (48 business hours turnaround) ISBN price list (the categories below include the combined processing fees and registration fees): 10 ISBNs: $375.00 100 ISBNs: $1,095.00 1000 ISBNs: $1,850.00 ________________________________ EXPRESS PROCESSING (24 business hours turnaround) ISBN price list (the categories below include the combined processing fees and registration fees): 10 ISBNs: $400.00 100 ISBNs: $1,120.00 1000 ISBNs: $1,875.00 ________________________________ SELECTING BAR CODES: The EAN-13 bar codes are the bar code translations for ISBNs. Most bookstores, distributors, and industry related sectors require EAN-13 bar codes on books and book type products. Bar code price list: 1-5 bar codes: $25 per bar code (i.e. 3 bar codes at $25 per unit will total $75) 6-10 bar codes: $23 per bar code (i.e. 6 bar codes at $23 per unit will total $138) 11-100 bar codes: $21 per bar code (i.e. 11 bar codes at $21 per unit will total $231) ________________________________ On Nov 7, 2007 3:19 PM, Brent Gueth wrote: > According to ISBN.org $1,570.00 per 1000 plus 180 processing fee - that's the highest lot number I could find by purchasing online - bowerbird may have a cheaper lot for getting that large of a lot. > > > > > > On Nov 7, 2007 3:13 PM, Michael Hart wrote: > > > > > What do 50,000 ISBN's cost??? > > > > mh > > > > On Wed, 7 Nov 2007, Bowerbird at aol.com wrote: > > > > > at some point in time down the line, i'll be able to offer > > > p.o.d. of the entire p.g. catalog. as i have said before, > > > i will direct part of the proceeds to the p.g. foundation > > > and part to michael hart for his longstanding devotion. > > > > > > one question here now is about the i.s.b.n. numbers. > > > (and yes, i know the "n" at the end of "i.s.b.n." stands > > > for "number", so that "i.s.b.n. numbers" is redundant.) > > > > > > i.s.b.n. are much cheaper in big blocks than small ones. > > > immensely so. (because they are a part of the system > > > that's designed to impose a high cost of entry on any > > > small publishers, to the benefit of the larger houses.) > > > > > > would p.g. be willing to pick up the cost of the i.s.b.n.? > > > it'd likely be smartest to buy a block of 50,000 or so... > > > what conditions, if any, would be imposed in return? > > > and how long would an official decision on this take, > > > from request to approval to the issuance of a check? > > > > > > please understand that this is _not_ asking for a "favor". > > > i've asked for favors, like when i asked for web-space. > > > i have no trouble discerning when i'm asking for a favor, > > > or saying that's what i'm doing. but this is not that case. > > > > > > either way, it's not going to make any difference at all > > > in the amount of money that p.g. ultimately receives, > > > because if the decision is a "no", i'll absorb the cost, so > > > the upshot is that it means it'll be that much longer until > > > the project moves into the black and p.g. gets any cash. > > > so either way, p.g. will underwrite it, directly or indirectly. > > > > > > so the only question is whether p.g. owns the i.s.b.n. or not, > > > and thus has the ability to continue selling the publications > > > once i go to heaven... > > > > > > -bowerbird > > > > > > > > > > > > ************************************** > > > See what's new at http://www.aol.com > > > > > _______________________________________________ > > gutvol-d mailing list > > gutvol-d at lists.pglaf.org > > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > From jon at noring.name Wed Nov 7 12:54:58 2007 From: jon at noring.name (Jon Noring) Date: Wed, 7 Nov 2007 13:54:58 -0700 Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog In-Reply-To: References: Message-ID: <1123139035.20071107135458@noring.name> > What do 50,000 ISBN's cost??? Hmmm, hard to say. I went to the ISBN.org site, which sells ISBNs for the United States: http://www.isbn.org/standards/home/isbn/us/isbn-fees.asp The largest block shown there is 1000 ISBNs, for $1750. It is a sliding scale, so there is hope: order 10: $27.50 each order 100: $9.95 each order 1000: $1.75 each Nothing is said if someone wanted to order a much larger block of ISBNs, such as 50,000. But I think one can safely say it is unlikely Bowker will sell ISBN's for a lot less than $1.75 each. How much lower they'd go, I don't have a clue. Nor do I know if they'd give PGLAF a break. (If they don't go below $1.75 per ISBN, then 50,000 will sell for $87,500 -- gulp.) Btw, we have to understand that there will be very few orders, if any, for the vast majority of PG texts. So in some ways these obscure titles will be "subsidized" by the better selling titles. Anyway, if this amount of money is too much for PGLAF, Bowerbird has offered to buy the ISBNs (if I read what he said correctly.) Now if one can find a POD company willing to sell PG ebooks using some identifier other than ISBN, that would be a better way to go. I've always liked UUID as an identifier -- it's free and can be generated by anyone. ISBN is a terrible book identifier anyway, which I've written about in the past -- and even worse for ebooks. Jon Noring From Bowerbird at aol.com Wed Nov 7 15:17:31 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 7 Nov 2007 18:17:31 EST Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog Message-ID: michael said: > What do 50,000 ISBN's cost??? i have no idea. it's been a long time since i cared, and got a block. > 10 ISBNs:? $275.00 > 100 ISBNs:? $995.00 > 1000 ISBNs:? $1,750.00 yeah, that's the pricing structure i remembered... :+) if you want 10, they're $27.50 each. if you want 100, they're $9.95 each. if you want 1000, they're $1.75 each. these are _numbers_, for crying out loud. there's very little _good_ reason why they should be "cheaper when you buy in bulk", they'd save a little in "administrative costs", sure, but there's very little good reason why those costs should be anything but trivial... no, this is simply the big publishing industry protecting itself from small competitors by raising the cost of entry as high as they can... so the price curve has a truly ridiculous slope. and wait til you see how cheap it is for 10,000! -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/6be2367b/attachment-0001.htm From donovan at abs.net Wed Nov 7 15:30:31 2007 From: donovan at abs.net (D Garcia) Date: Wed, 7 Nov 2007 19:30:31 -0400 Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog In-Reply-To: References: Message-ID: <200711071830.32572.donovan@abs.net> On Wednesday 07 November 2007 15:13, Michael Hart wrote: > What do 50,000 ISBN's cost??? > > mh Why would PG want to buy ISBNs when so many of the titles already have one assigned? From jon at noring.name Wed Nov 7 15:40:07 2007 From: jon at noring.name (Jon Noring) Date: Wed, 7 Nov 2007 16:40:07 -0700 Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog In-Reply-To: References: Message-ID: <1906343196.20071107164007@noring.name> Bowerbird wrote: > no, this is simply the big publishing industry > protecting itself from small competitors by > raising the cost of entry as high as they can... From those I've talked with in the publishing industry, most large U.S. publishers would prefer ISBN to be free, as it is in many other countries. ISBN pricing is a real sticking point. Interestingly, the high cost of ISBNs in the U.S. has led many larger publishers to reuse the same ISBN for different ebook formats of the same title (which is a no-no per the ISBN ISO Standard), and this has led to problems in the ebook retail market where the PDF, LIT and MobiPocket (to name three) versions of a title are rolled into the same ISBN. Imagine what would happen if the hard cover and soft cover print versions of a title were given the same ISBN number? > so the price curve has a truly ridiculous slope. > > and wait til you see how cheap it is for 10,000! Well, Bowker does not include a 10,000 option in its order form. But it is possible for very large clients they will give a further per-number discount. Maybe Greg or Michael, representing PGLAF, should give Bowker a call and see if there is a further discount for very big accounts, and maybe a special discount for PG being a non-profit. Until Greg or Michael does that, it is premature to say with any certainty what Bowker will charge for 50,000 ISBN numbers. (No doubt PG could negotiate with Bowker -- everything is negotiable -- but how much Bowker will discount for 50,000 is hard to predict.) Jon Noring From jon at noring.name Wed Nov 7 16:23:02 2007 From: jon at noring.name (Jon Noring) Date: Wed, 7 Nov 2007 17:23:02 -0700 Subject: [gutvol-d] my thoughts on ISBN Message-ID: <209164248.20071107172302@noring.name> This discussion of getting 50,000 ISBN numbers, and then the comment that PG is already assigning ISBNs (to what?, some titles?), brings up an interesting side topic. Wikipedia has good background summary of ISBN: http://en.wikipedia.org/wiki/Isbn It was developed in 1966 by British (paper) book sellers, well before the digital era. It is an ISO standard. Note that sellers developed it, not publishers. When developed, and until recently, ISBN was intended to be a "Manifestation" identifier. That is, it was intended to identify the particular "object" for sale -- it was NOT a title ("Expression") identifier. That's why ISBN has close ties to barcodes. It's more like a UPC code. So the hard cover of a title is given a different ISBN from the paperback edition, and from the large print edition, etc. And this is important for retailers who have to keep track of sales since they sell "objects", not "titles". To a retailer, if they can sell 10,000,000 books, they don't care what the titles are. That is, they sell *books*, not *titles*. And ISBN is a book identifier, not a title identifier. As the ebook era dawned (for the large publishers sort of began about 1999/2000), book publishers all of a sudden saw that a title may need to be cast into a number of formats, so all of a sudden the need for ISBNs for a single title substantially increased. Since the large publishers are still pretty frugal folk, several of them decided that an "ebook" is an "ebook" and simply assigned the same ISBN no matter the format. All of a sudden, many publishers are now using ISBN as an "Expression" identifier, which is a "no-no" per the ISO standard (but understandable given the high cost Bowker charges for ISBN.) This has forced ebook retailers to internally expand the ISBN to include the ebook format code, since otherwise how can they and their customers differentiate between the different format versions (e.g., PDF from LIT)? PG can certainly use an ISBN as a sort of "Expression" code, but it is non-standard. And if so, then why even use ISBN? Thus, IMHO PG should only concern itself with ISBN when there is a market need for it. Otherwise PG should stay away from ISBN like the plague. Since POD may require ISBN, that may be a need. But I'd arrange with a POD provider to see if they'd accept a "home grown" ID that may look like an ISBN but is not an ISBN -- maybe uses hexadecimal instead of decimal notation, or something else, so it won't "clash" with any valid ISBNs out there. (Hmmm, the ISO standard behind ISBN might actually have some odd extensions that are never used, but there to use...) Jon Noring From Bowerbird at aol.com Wed Nov 7 16:47:47 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 7 Nov 2007 19:47:47 EST Subject: [gutvol-d] =?iso-8859-1?q?!=40!_Re=3A=A0_p=2Eo=2Ed=2E_of_the_enti?= =?iso-8859-1?q?re_catalog?= Message-ID: donovan said: > Why would PG want to buy ISBNs when > so many of the titles already have one assigned? well, because that's what a publisher does when you republish a public-domain book, so the i.s.b.n. points to _your_ publication, and not to any of the _previous_ editions... more specifically, when these versions go in google's system, and an end-user clicks to get a printed copy, then i will get the order. also, _my_ publications will be "full-view", so we circumvent that ridiculous situation where a public-domain book is "locked up" by publishers to increase hard-copy sales. there are a lot of p.g. e-texts that've been "repurposed" to print. i'm _fine_ with that, right up until they put it in the "limited view" section of google print. so i'm fixing that... and giving people the option of supporting project gutenberg and michael hart with the purchase of a nicely-formatted hard-copy of their favorite books from project gutenberg, nice formatting they can't easily get otherwise. so that's why... but i thought all that would be fairly obvious. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/698b3ab1/attachment.htm From lee at novomail.net Wed Nov 7 19:47:13 2007 From: lee at novomail.net (Lee Passey) Date: Wed, 07 Nov 2007 20:47:13 -0700 Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog) In-Reply-To: <200711071830.32572.donovan@abs.net> References: <200711071830.32572.donovan@abs.net> Message-ID: <473286C1.9000309@novomail.net> D Garcia wrote: > Why would PG want to buy ISBNs when so many of the titles already have one > assigned? ISBNs were not invented until 1966. Because most of the PG corpus is in the Public Domain, and first published long before 1966, if a title /does/ have one or more ISBNs assigned it's one that was assigned in a subsequent printing by some publisher. As demonstrated by Mr. Perathoner's example of September 5, some of the most popular titles can have dozens, if not hundreds, of ISBNs, each assigned by a different publisher. Indeed, the ISBN is most useful in identifying a /publisher/ not a title or an author. In fact, the only real use I can see for an ISBN is so when a bookstore owner is running low on stock (or has a request for a rare book) s/he can go to Bowker's Books In Print, find the publisher, and call in another order. For someone outside the retail chain ISBNs are virtually useless. Most of the PG corpus probably did come from books that had ISBNs, and some are an amalgam of multiple books each having its own ISBN. Whatever these ISBNs were (if they existed at all), however, is lost in the mists of time. The prices mentioned here for ISBNs is if you obtain them from the U.S. ISBN agency, which is R.R.Bowker Co. Project Gutenberg is an international organization, so if it really wanted to obtain a block of ISBNs for its own use it makes sense to me to obtain them from a /non/ U.S. agency from which they are typically available at a /much/ reduced price (free, if some reports can be believed). It may be that the need for an ISBN comes from a POD provider, and there is no intent to ever register a title for inclusion in Books In Print. If the POD provider doesn't validate the ISBN, its possible to just make one up. Of course, it would be bad form to claim an ISBN that some other company has, or may have, the rights to use. However, ISBNs have an interesting property: the last digit in the ISBN is a checksum digit. Because this checksum digit is based on a calculation modulo 11, for every valid ISBN there are 10 /invalid/ ISBNs which differ only by the last, or checksum, digit. If PG wanted to create a number that could be used in place of an ISBN, without risk that it would ever collide with a real ISBN it would suffice to create a method of generating unique 9-digit numbers, compute the standard ISBN checksum, and then add 1 (or some other number less than 11) to the checksum before computing the modulus. You wouldn't be able to register the publication with Books In Print, but for all other uses it ought to be just fine. From robert_marquardt at gmx.de Wed Nov 7 23:41:47 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Thu, 08 Nov 2007 08:41:47 +0100 Subject: [gutvol-d] I want the imagemap extension for the Wiki Message-ID: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com> My idea is to create an "Adventskalender" for the Christmas time. Here random example found by Google image search http://www.gedichte-garten.de/adventskalender/adventskalender.shtml A free Christmas or winter image and some numbers placed on it should do the trick. The numbers linked to books from the Christmas Bookshelf. Rigged up on Nov 31 and deleted on Dec 25. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From marcello at perathoner.de Thu Nov 8 04:26:36 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 08 Nov 2007 13:26:36 +0100 Subject: [gutvol-d] I want the imagemap extension for the Wiki In-Reply-To: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com> References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com> Message-ID: <4733007C.1050503@perathoner.de> Robert Marquardt wrote: > My idea is to create an "Adventskalender" for the Christmas time. > Here random example found by Google image search > http://www.gedichte-garten.de/adventskalender/adventskalender.shtml > > A free Christmas or winter image and some numbers placed on it should > do the trick. The numbers linked to books from the Christmas > Bookshelf. Rigged up on Nov 31 and deleted on Dec 25. We are in a quandary here: the current "supported" ImageMap extension is for MediaWiki 1.9+. We are still running MediaWiki 1.6.8 because ibiblio used to have PHP4. Since ibiblio switched to PHP5 somwhere in August I had no time to upgrade to the current version. All I can do in the short term is to install this outdated version: http://www.mediawiki.org/wiki/Extension:ImageMap_%28McNaught%29 How is that with you? -- Marcello Perathoner webmaster at gutenberg.org From robert_marquardt at gmx.de Thu Nov 8 06:38:58 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Thu, 08 Nov 2007 15:38:58 +0100 Subject: [gutvol-d] I want the imagemap extension for the Wiki In-Reply-To: <4733007C.1050503@perathoner.de> References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com> <4733007C.1050503@perathoner.de> Message-ID: On Thu, 08 Nov 2007 13:26:36 +0100, you wrote: >We are still running MediaWiki 1.6.8 because ibiblio used to have PHP4. >Since ibiblio switched to PHP5 somwhere in August I had no time to >upgrade to the current version. > >All I can do in the short term is to install this outdated version: > > http://www.mediawiki.org/wiki/Extension:ImageMap_%28McNaught%29 > > >How is that with you? I could not yet find out how the .map file works, but as long as we get it working it should be good enough. We can uninstall the extension at the end of the year. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From nwolcott2ster at gmail.com Thu Nov 8 06:38:17 2007 From: nwolcott2ster at gmail.com (Norm Wolcott) Date: Thu, 8 Nov 2007 09:38:17 -0500 Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog) References: <200711071830.32572.donovan@abs.net> <473286C1.9000309@novomail.net> Message-ID: <00a501c82215$0862a800$660fa8c0@atlanticbb.net> ISBN's are free in Canada. The application form is on their website. (google isbn canada). You do need a Canada address however for them to send you the ISBN's. Canada ISBN's are not searchable at Barnes and Noble for example, at least when I tried at their store fpr one (a real book) nothing came up. They may have been using Booksin Print which only has US ISBN's. nwolcott2 at post.harvard.edu ----- Original Message ----- From: "Lee Passey" To: "Project Gutenberg Volunteer Discussion" Sent: Wednesday, November 07, 2007 10:47 PM Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog) > D Garcia wrote: > > > Why would PG want to buy ISBNs when so many of the titles already have one > > assigned? > > ISBNs were not invented until 1966. Because most of the PG corpus is in > the Public Domain, and first published long before 1966, if a title > /does/ have one or more ISBNs assigned it's one that was assigned in a > subsequent printing by some publisher. As demonstrated by Mr. > Perathoner's example of September 5, some of the most popular titles can > have dozens, if not hundreds, of ISBNs, each assigned by a different > publisher. > > Indeed, the ISBN is most useful in identifying a /publisher/ not a title > or an author. In fact, the only real use I can see for an ISBN is so > when a bookstore owner is running low on stock (or has a request for a > rare book) s/he can go to Bowker's Books In Print, find the publisher, > and call in another order. For someone outside the retail chain ISBNs > are virtually useless. > > Most of the PG corpus probably did come from books that had ISBNs, and > some are an amalgam of multiple books each having its own ISBN. Whatever > these ISBNs were (if they existed at all), however, is lost in the mists > of time. > > The prices mentioned here for ISBNs is if you obtain them from the U.S. > ISBN agency, which is R.R.Bowker Co. Project Gutenberg is an > international organization, so if it really wanted to obtain a block of > ISBNs for its own use it makes sense to me to obtain them from a /non/ > U.S. agency from which they are typically available at a /much/ reduced > price (free, if some reports can be believed). > > It may be that the need for an ISBN comes from a POD provider, and there > is no intent to ever register a title for inclusion in Books In Print. > If the POD provider doesn't validate the ISBN, its possible to just make > one up. > > Of course, it would be bad form to claim an ISBN that some other company > has, or may have, the rights to use. However, ISBNs have an interesting > property: the last digit in the ISBN is a checksum digit. Because this > checksum digit is based on a calculation modulo 11, for every valid ISBN > there are 10 /invalid/ ISBNs which differ only by the last, or checksum, > digit. > > If PG wanted to create a number that could be used in place of an ISBN, > without risk that it would ever collide with a real ISBN it would > suffice to create a method of generating unique 9-digit numbers, compute > the standard ISBN checksum, and then add 1 (or some other number less > than 11) to the checksum before computing the modulus. You wouldn't be > able to register the publication with Books In Print, but for all other > uses it ought to be just fine. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From creeva at gmail.com Thu Nov 8 07:10:35 2007 From: creeva at gmail.com (Brent Gueth) Date: Thu, 8 Nov 2007 10:10:35 -0500 Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog) In-Reply-To: <00a501c82215$0862a800$660fa8c0@atlanticbb.net> References: <200711071830.32572.donovan@abs.net> <473286C1.9000309@novomail.net> <00a501c82215$0862a800$660fa8c0@atlanticbb.net> Message-ID: <2510ddab0711080710s424fe56dx98b8e5eeefcc614@mail.gmail.com> Beyond the Canadian suggestion - why don't we work with the creative commons folks to come up with an open format since I'm sure they are going to run into the same issue at some point. If a collaboration worked together for a .10 or .5 maintenance fee for each I'm sure there would be a large adoption for the open community to start registering more items if the barrier to entry was significantly lowered. On Nov 8, 2007 9:38 AM, Norm Wolcott wrote: > ISBN's are free in Canada. The application form is on their website. (google > isbn canada). You do need a Canada address however for them to send you the > ISBN's. Canada ISBN's are not searchable at Barnes and Noble for example, at > least when I tried at their store fpr one (a real book) nothing came up. > They may have been using Booksin Print which only has US ISBN's. > > > nwolcott2 at post.harvard.edu > > ----- Original Message ----- > From: "Lee Passey" > To: "Project Gutenberg Volunteer Discussion" > Sent: Wednesday, November 07, 2007 10:47 PM > Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog) > > > > D Garcia wrote: > > > > > Why would PG want to buy ISBNs when so many of the titles already have > one > > > assigned? > > > > ISBNs were not invented until 1966. Because most of the PG corpus is in > > the Public Domain, and first published long before 1966, if a title > > /does/ have one or more ISBNs assigned it's one that was assigned in a > > subsequent printing by some publisher. As demonstrated by Mr. > > Perathoner's example of September 5, some of the most popular titles can > > have dozens, if not hundreds, of ISBNs, each assigned by a different > > publisher. > > > > Indeed, the ISBN is most useful in identifying a /publisher/ not a title > > or an author. In fact, the only real use I can see for an ISBN is so > > when a bookstore owner is running low on stock (or has a request for a > > rare book) s/he can go to Bowker's Books In Print, find the publisher, > > and call in another order. For someone outside the retail chain ISBNs > > are virtually useless. > > > > Most of the PG corpus probably did come from books that had ISBNs, and > > some are an amalgam of multiple books each having its own ISBN. Whatever > > these ISBNs were (if they existed at all), however, is lost in the mists > > of time. > > > > The prices mentioned here for ISBNs is if you obtain them from the U.S. > > ISBN agency, which is R.R.Bowker Co. Project Gutenberg is an > > international organization, so if it really wanted to obtain a block of > > ISBNs for its own use it makes sense to me to obtain them from a /non/ > > U.S. agency from which they are typically available at a /much/ reduced > > price (free, if some reports can be believed). > > > > It may be that the need for an ISBN comes from a POD provider, and there > > is no intent to ever register a title for inclusion in Books In Print. > > If the POD provider doesn't validate the ISBN, its possible to just make > > one up. > > > > Of course, it would be bad form to claim an ISBN that some other company > > has, or may have, the rights to use. However, ISBNs have an interesting > > property: the last digit in the ISBN is a checksum digit. Because this > > checksum digit is based on a calculation modulo 11, for every valid ISBN > > there are 10 /invalid/ ISBNs which differ only by the last, or checksum, > > digit. > > > > If PG wanted to create a number that could be used in place of an ISBN, > > without risk that it would ever collide with a real ISBN it would > > suffice to create a method of generating unique 9-digit numbers, compute > > the standard ISBN checksum, and then add 1 (or some other number less > > than 11) to the checksum before computing the modulus. You wouldn't be > > able to register the publication with Books In Print, but for all other > > uses it ought to be just fine. > > _______________________________________________ > > gutvol-d mailing list > > gutvol-d at lists.pglaf.org > > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Thu Nov 8 15:10:35 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 8 Nov 2007 18:10:35 EST Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog) Message-ID: brent said: > why don't we work with the creative commons folks > to come up with an open format yeah! fight the power! up with the people! right on! :+) listen, folks, i'm glad i got you all "musing" and everything, but i should've just asked michael backchannel about this... i will almost certainly need isbn's -- real ones, the u.s. kind, which will have to be purchased from the bowker b*st*rds -- because my guess is that's what google requires these days... (and objective number 1 is the google system, so we _know_ that people are informed they can read these books for free.) and even if google doesn't, then the p.o.d. place i use might. (because objective number 2 is giving people pretty output.) and bookstores absolutely do. not that i intend to put books in bookstores, but i'm not gonna turn down any orders either. so bowker's books-in-print is one target. and so is amazon. but, you know, best of luck with that whole revolution thing. no longer will we allow the i.s.b.n. to step down on our neck! -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071108/59a9d87d/attachment.htm From jon at noring.name Thu Nov 8 15:16:22 2007 From: jon at noring.name (Jon Noring) Date: Thu, 8 Nov 2007 16:16:22 -0700 Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog) In-Reply-To: References: Message-ID: <1488200996.20071108161622@noring.name> Bowerbird said: > listen, folks, i'm glad i got you all "musing" and everything, > but i should've just asked michael backchannel about this... Probably. And definitely Greg or Michael needs to call the Bowker folk to get their pricing for 50,000 ISBNs. My "musings" were to aid in understanding the role and alternatives to ISBN, but I also noted the pragmatic reality of getting U.S. ISBNs. Jon From gutenberg at gagravarr.org Sat Nov 10 10:23:58 2007 From: gutenberg at gagravarr.org (Nick Burch) Date: Sat, 10 Nov 2007 18:23:58 +0000 (GMT) Subject: [gutvol-d] UK based volunteer to scan a few books? Message-ID: Hi All I hope this is the right volunteer list to post on for this... I've got 7 books from the late 19th century, which I've checked and are out of copyright, and seem interesting enough to contribute to the project. However, I don't have a scanner. Is there a volunteer in the UK who'd be interested in scanning them in, if I were to post the books to them? (I'll happily pay for postage) Nick From desrod at gnu-designs.com Sat Nov 10 17:08:57 2007 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Sat, 10 Nov 2007 20:08:57 -0500 Subject: [gutvol-d] "Turning the Pages of an eBook - Realistic Electronic Books" Message-ID: <1194743337.6413.2.camel@localhost.localdomain> I just found this Google video on YouTube, and found some of the items discussed (as well as all the eye-candy demos), to be quite interesting, especially with regard to our recent discussions about digitizing ebooks in a way that represents the "real" book structure. http://www.youtube.com/watch?v=9Y-BM3Z5xy0 -- David A. Desrosiers desrod at gnu-designs.com setuid at gmail.com http://projects.plkr.org/ Skype...: 860-967-3820 From ajhaines at shaw.ca Sun Nov 11 14:49:02 2007 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Sun, 11 Nov 2007 14:49:02 -0800 Subject: [gutvol-d] Multi-volume book set with master index Message-ID: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400> I'm working on a 4-volume set of books. The set's master index is in volume 4. Which is preferred: - to also include the index in volumes 1-3 of the set (for readers' convenience), or - to leave those volumes as they are? Regards, Al From ralf at ark.in-berlin.de Mon Nov 12 01:28:58 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 12 Nov 2007 10:28:58 +0100 Subject: [gutvol-d] Multi-volume book set with master index In-Reply-To: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400> References: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400> Message-ID: <20071112092858.GA28414@ark.in-berlin.de> > Which is preferred: > > - to also include the index in volumes 1-3 of the set (for readers' > convenience), or > - to leave those volumes as they are? I have the same problem somewhere on the horizon and I settled my plans with doing all four without index and a fifth complete with index edition. YMMV. ralf From gbnewby at pglaf.org Mon Nov 12 08:16:52 2007 From: gbnewby at pglaf.org (Greg Newby) Date: Mon, 12 Nov 2007 08:16:52 -0800 Subject: [gutvol-d] Multi-volume book set with master index In-Reply-To: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400> References: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400> Message-ID: <20071112161652.GB6326@mail.pglaf.org> On Sun, Nov 11, 2007 at 02:49:02PM -0800, Al Haines (shaw) wrote: > I'm working on a 4-volume set of books. The set's master index is in volume > 4. > > Which is preferred: > > - to also include the index in volumes 1-3 of the set (for readers' > convenience), or > - to leave those volumes as they are? > Hi, Al. It's definitely up to you. If the index will be live (that is, hyperlinks into the right locations in the HTML documents for the different volumes), it will be challenging to set up the links to external files (because you won't know the eBook #). We can pre-assign a set of eBook #s, but even so that's not so user-friendly (since people could rename after download). If it's not live/linked, then this is less of an issue. To me, having a duplication of the index, with references to each separate volume, would be slightly more user-friendly at the expense of making the individual volumes' files larger. -- Greg From jon at noring.name Tue Nov 13 10:42:28 2007 From: jon at noring.name (Jon Noring) Date: Tue, 13 Nov 2007 11:42:28 -0700 Subject: [gutvol-d] Announcing: The Digital Text Community mailing list Message-ID: <111525527.20071113114228@noring.name> Everyone, I am announcing the start of "The Digital Text Community", a public mailing list (on YahooGroups) devoted to serious discussion of digitizing "ink-on-paper" publications. The full group description is found at the group's "home page" at: http://groups.yahoo.com/group/digital-text/ The primary reason why I am starting DTC is that there is, suprisingly, no independent forum to discuss the various technical and non-technical issues of digitizing "ink-on-paper" publications. Current discussion on digitizing paper publications is disjointly spread around in various nooks and crannies of the Internet. For example, there are forums for particular digitization projects such as those run by Project Gutenberg (e.g. "gutvol-d") and Distributed Proofreaders (an online set of forums.) And then there are forums which touch upon various issues of text digitization but which is not their main focus. Examples are Book People (which John Mark Ockerbloom is closing the end of the month) and The eBook Community (a YahooGroup which I administer.) The summary purpose of DTC is given in the last paragraph of the DTC group description: "This group is not affiliated with any particular project or organization, but rather is independent. It is hoped this group will be a bridge between the various text digitization projects, enabling information exchange for everyone?s benefit." Do consider subscribing to DTC. If you need any help with subscribing to the group, let me know. Look forward to seeing you there! Jon Noring The Digital Text Community Administrator From Bowerbird at aol.com Thu Nov 15 09:08:34 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 15 Nov 2007 12:08:34 EST Subject: [gutvol-d] rumor-mongers Message-ID: the rumor-mongers have another "announce-date" for the kindle. (that's amazon's reader-machine, if you're out of the rumor loop.) this time it's monday, november 19th. could be. but i doubt it. i seriously doubt it. but nonetheless, because one guy says he is invited to an amazon press-conference, and he says it "might" be for the purpose of announcing the kindle... well, heck... that's all the good reason that the rumor-mongers need to rerun their tired speculation again. and so we have it. mobileread runs it: > http://www.mobileread.com/forums/showthread.php?t=16111 and the rothman teleblawg runs it: > http://www.teleread.org/blog/?p=7637 of course, they ran similar items back in september, promising that _october_ would be the due-date, and they also followed up on a n.y. times article (a retraction of the october prediction that moved it up to "end-year"), until i reminded them that only the most clueless of companies would release a niche gadget product _then_, after the _very_end_ of the year's big gift-buying season... and, believe me, amazon is _not_ a "clueless" company. and of course, they _also_ ran similar items back this _spring_, predicting that the release would be then. oh, and of course, they _also_ ran similar items _last_fall_ -- yes, they're now over a year late on their original predictions -- so maybe you should examine their track-record on this and decide you just don't have time for this kind of noise... for me, on the other hand, this stuff is wildly amusing... i don't know how i'd get through a week without having teleblawg speculation giving me laughs along the way... i definitely couldn't make this stuff up. -bowerbird p.s. and rothman, once again, even brought back his _$50_ pricepoint, this time talking about o.l.p.c., in an entry today. somehow, the rumors always seem sparkly and fresh-baked. it's amazing, isn't it? ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071115/6bb74589/attachment.htm From sly at victoria.tc.ca Fri Nov 16 14:55:14 2007 From: sly at victoria.tc.ca (Andrew Sly) Date: Fri, 16 Nov 2007 14:55:14 -0800 (PST) Subject: [gutvol-d] International Year of Languages Message-ID: I'm just going to throw out an idea here to see what people think. The U.N. General Assembly has declared 2008 to be the "International Year of Languages". What do fellow gutvol-d inhabitants think of the idea of having a day, or perhaps a week, where we try to have texts posted in as many languages as possible. This could mean "saving up" some of them, so as to have them all ready around the same time. I could envision that making a good press release. I also have ideas for different places and people I could go to, to encourage more participation in different langauges. This might be easier if I can say that it is to be done for a special day, or event. I've tried to see if there is one particular day, or time of year that would be most appropriate. I can find a number of schools having some kind of "World Language Day" in 2008, but they are all on different days. What might work best for the purposes of PG? Feedback? Andrew From piggy at netronome.com Fri Nov 16 19:35:02 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Fri, 16 Nov 2007 22:35:02 -0500 Subject: [gutvol-d] International Year of Languages In-Reply-To: References: Message-ID: <473E6166.9060601@netronome.com> Andrew Sly wrote: > I'm just going to throw out an idea here to see what people think. > > The U.N. General Assembly has declared 2008 to be the > "International Year of Languages". > > What do fellow gutvol-d inhabitants think of the idea of having > a day, or perhaps a week, where we try to have texts posted in > as many languages as possible. This could mean "saving up" some > of them, so as to have them all ready around the same time. > I think this is a delightful idea. I have a nice set of small Georgian books I've been meaning to put through DPEU. I also have an Enga book I think I can clear. > I could envision that making a good press release. > > I also have ideas for different places and people I could > go to, to encourage more participation in different langauges. > This might be easier if I can say that it is to be done for a > special day, or event. > > I've tried to see if there is one particular day, or time of > year that would be most appropriate. I can find a number of > schools having some kind of "World Language Day" in 2008, > but they are all on different days. What might work best for > the purposes of PG? > What about trying to pick a day for each language appropriate to that language? > Feedback? > > Andrew > From ricardofdiogo at gmail.com Sat Nov 17 12:46:46 2007 From: ricardofdiogo at gmail.com (Ricardo F Diogo) Date: Sat, 17 Nov 2007 20:46:46 +0000 Subject: [gutvol-d] International Year of Languages In-Reply-To: <473E6166.9060601@netronome.com> References: <473E6166.9060601@netronome.com> Message-ID: <9c6138c50711171246q5e9dc194r19781d79da1a7867@mail.gmail.com> 2007/11/17, La Monte H.P. Yarroll : > What about trying to pick a day for each language appropriate to that > language? Sounds great. For Portuguese it'd be June 10 (Day of Portugal, Camoes and the Portuguese Communities) and November 5 (Day of Portuguese Language in Brazil). Ricardo From Bowerbird at aol.com Mon Nov 19 10:42:27 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 19 Nov 2007 13:42:27 EST Subject: [gutvol-d] the game Message-ID: ok, the game just got a little more interesting: :+) > http://www.amazon.com/gp/product/B000FI73MA/ref=sa_menu_kdp3/103-5393010-8448654 i'm interested in the utility of this thing as a general web-browser -- how well will it work, and how will such use impact the costs? -- but in general i am impressed with this. and bezos _did_ get it out before thanksgiving -- with even 3 days to spare! -- so that's good. so then, let's see what the reviews are from the early buyers... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071119/208a4134/attachment.htm From Bowerbird at aol.com Thu Nov 22 12:38:21 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 22 Nov 2007 15:38:21 EST Subject: [gutvol-d] happy thanksgiving Message-ID: have a happy thanksgiving, all! :+) including any native american "indians" out there! -bowerbird ************************************** Check out AOL's list of 2007's hottest products. (http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071122/5e3175bc/attachment.htm From robert_marquardt at gmx.de Sun Nov 25 21:13:22 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Mon, 26 Nov 2007 06:13:22 +0100 Subject: [gutvol-d] I want the imagemap extension for the Wiki In-Reply-To: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com> References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com> Message-ID: <77lkk3hobo9ttk6nk044hl674ag5qm0neu@4ax.com> On Thu, 08 Nov 2007 08:41:47 +0100, you wrote: >My idea is to create an "Adventskalender" for the Christmas time. >Here random example found by Google image search >http://www.gedichte-garten.de/adventskalender/adventskalender.shtml > >A free Christmas or winter image and some numbers placed on it should do the trick. The numbers linked to books from the >Christmas Bookshelf. Rigged up on Nov 31 and deleted on Dec 25. Marcelo has installed the extension, but now i am completely unable to do the work and time is running short. The calendar should be rigged up at Nov 31. Can i get some help? I asked Juliet Sutherland from DP to give me a list of 24 Christmas books. To complete the work we need a free picture. Best a simple winter landscape instead of such agressive Santa pictures. The numbers 1 to 24 have to be painted upon it with equal-sized boxes around it. This should be not too complicated. I fear i have to lean on Marcelo to create the imagemap. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From ralf at ark.in-berlin.de Mon Nov 26 03:00:31 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 26 Nov 2007 12:00:31 +0100 Subject: [gutvol-d] I want the imagemap extension for the Wiki In-Reply-To: <77lkk3hobo9ttk6nk044hl674ag5qm0neu@4ax.com> References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com> <77lkk3hobo9ttk6nk044hl674ag5qm0neu@4ax.com> Message-ID: <20071126110031.GA6402@ark.in-berlin.de> You wrote > To complete the work we need a free picture. Best a simple winter landscape instead of such agressive Santa pictures. Take 24: http://commons.wikimedia.org/wiki/Winter ralf From robert_marquardt at gmx.de Mon Nov 26 21:34:32 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Tue, 27 Nov 2007 06:34:32 +0100 Subject: [gutvol-d] I want the imagemap extension for the Wiki In-Reply-To: <20071126110031.GA6402@ark.in-berlin.de> References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com> <77lkk3hobo9ttk6nk044hl674ag5qm0neu@4ax.com> <20071126110031.GA6402@ark.in-berlin.de> Message-ID: <92bnk39lhdl5cvqpqtni5vqps9kd5hnjab@4ax.com> On Mon, 26 Nov 2007 12:00:31 +0100, you wrote: >You wrote >> To complete the work we need a free picture. Best a simple winter landscape instead of such agressive Santa pictures. > >Take 24: > >http://commons.wikimedia.org/wiki/Winter > > >ralf Thanks, Landscape_in_Bavarian_in_wintertime.jpg is ideal. Maybe ifind the energy to do some work. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From Bowerbird at aol.com Wed Nov 28 13:20:08 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 28 Nov 2007 16:20:08 EST Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend Message-ID: hope you all had a lovely long weekend... now it's back to work... :+) *** here's a long message, on many topics, conveniently combined here to minimize wear-and-tear on your delete-key finger... ;+) *** the challenge... as a way of presenting the challenge to myself, i've put up a graphic showing the opening page of 3 different editions of "alice in wonderland", all of them attained from the internet archive: > http://z-m-l.com/misc/thechallenge.png as you can see, the o.c.a. gets very good o.c.r. indeed, on some books, it's amazingly accurate. you can also see that, even still, it ain't perfect... the objective is to use the different versions to converge upon an _error-free_ version of each -- with as little human interaction as possible -- retaining linebreaks idiosyncratic to each edition. for example, load this into another browser-tab: > http://z-m-l.com/misc/thechallenge2.png and toggle between the two to show differences. i'll be making my own tool to accomplish this, but i welcome the efforts of other programmers too... perhaps then we could compare _our_ outputs to come to an even _more_ satisfying convergence... if you _are_ a programmer who'd like to take this on, you should also take a quick look at both of these files: > http://z-m-l.com/go/pap/pride%20and%20prejudice(4).txt > http://z-m-l.com/go/pap/pride_and_prejudice(4).html in addition, you might want to examine these demos: > http://snowy.arsc.alaska.edu/bowerbird/oneoo/oneoo-compweball.html > http://snowy.arsc.alaska.edu/bowerbird/oneoo/oneoo-compwebone.html which involved a similar comparison-across-editions. *** oh, and by the way, i've created a version of "alice" based on yet _another_ edition -- from google -- which you can peruse here, if you would care to: > http://z-m-l.com/go/aiwon/aiwonp001.html the scans are fairly crappy, actually _really_ crappy, but hey, sometimes you get what you pay for, right? and the o.c.r. -- as you'd imagine -- was atrocious, so i used the p.g. e-text as my base. what i found was that that file -- which i'd thought was _clean_ -- actually contains quite a bit of noise. some of that might be due to it coming from another edition, yes, (the british spellings, for sure); however, there were also a few outright _errors_. someone might want to clean up the p.g. file if they can find the source-text. not that i'm crabbing about "faithfulness", mind you, or "trustworthiness", or any of those other bogeymen; just saying there are errors that you might want to fix. actually, i've often thought that "alice" was the answer for why we didn't want to have p.g. e-texts adhere to a specific version, at least in our low-bandwidth past... the two footnotes that say "later editions added this" were a particularly adept way of handling that matter, given the alternative of mounting _another_ edition varying from the first only by the additional passages. indeed, you'll note that -- even in the edition i posted, which is fairly "faithful" to the 1898 edition's scans -- i included those two notes, as a worthwhile addition... anyway, there are errors in pg#11, if anyone cares... oh yeah, and my version isn't totally clean yet either. i made so many changes that it needs a second pass. plus, since i've changed the linebreaks, you will need a tool like the one i outlined above to do the job right. but still, in the absence of anything better, it is there: > http://z-m-l.com/go/aiwon/aiwon.zml *** since thanksgiving has come and gone, i can report on this year's update to a prediction i made two years ago: > http://www.teleread.org/blog/2005/11/29/you-can-buy-the-mit-100-lapt op-for-200/ at that time, david rothman had been yammering on for several years about "the coming $50 e-book-machine"; when he repeated this ridiculousness yet another time, i finally decided i would call him on this little bit of spin. rothman responded with: > Folks, tune in a year from now, and we?ll see who?s right i bet him that it would be _five_years_ before his cheapo machine would be readily available, or i'd buy him lunch. he came back with a lot of blah-blah-blah about o.l.p.c. well, the first year came and went, with no such machine, not at $50, and not $100, and not $150, not even _$200_, which he'd predicted for 2005, indeed not at _any_ price... so he was eating crow for thanksgiving... and now the second year has come and gone, and _still_ no $50 machine, or $100 machine. the o.l.p.c. _is_ finally available for sale -- as a charity case (i gave, did you?) -- but it's a _$200_ machine, and you have to buy _2_ of 'em. (but you only get one, as the other is your "contribution"; i'm cool with that, since it's an _extremely_ good cause...) so david's "prediction"? 2 years late, with 100% over-run. meaning he was eating crow again for _this_ thanksgiving. and, of course, you know about the other machines now. the sony costs about $300, amazon's will run you $400, there are a few other contenders in that same range, and the iliad tops out all the prices at an ungodly $600-plus. if all this doesn't make you realize that a $50 prediction back in 2005 (or even going all the way back to _1992_, which is what rothman constantly likes to remind us) is _pure_folly_, then you, my friends, grasp reality poorly... an e-book-machine is a _computer_. it needs a _chip_ and a _screen_, which are the expensive elements of any computer. so you can't make a cheap e-book-machine. and when you can make an inexpensive e-book-machine, you'll be able to make an inexpensive _computer_ as well, and _no_one_ will want a limited-usage e-book-machine, not when they can get a full computer for the same price. so it's _mindless_spin_ to talk of cheap e-book-machines. anyway, this is pretty much what i was expecting all along. next year the o.l.p.c. (and its commercial rivals) will cost about $200 (without requiring you buy more than one)... the thanksgiving after that, the price'll be around $100, and the year after that -- 5 years from my original bet -- the price _might_ drop to as low as $50. (or might not...) rothman, completely wrong. bowerbird, completely right. and hey, the o.l.p.c. has done a _big_ favor to _everyone_. by issuing the mere _threat_ to create a low-price laptop, and having the crack mary lou jepsen make good on that threat by solving the way-too-expensive-screen problem (the e-ink greedsters thought that they had a monopoly), the commercial side has been forced to attend to the task. otherwise they would have put if off as long as they could... *** anyway, back to the digitization workbench... a supporter of the "epub" file-format digitized a copy of "woodcraft", an early classic environmentally-geared book: > http://www.zianet.com/jgray as i've said all along, i love it when people use that format, because it requires them to put in a whole bunch of work laying out the _structure_ of the book, so it's as easy as pie for me to then remix all of their work into z.m.l. pudding... so i did that. i thought it'd be a good exercise for my .pdf converter, so i ran that, making some improvements to it along the way. i decided to keep the linebreaks in the _text_ version of the file which i had obtained from the site listed above, which meant i had to use a pointsize that's fairly _small_. (more later, since that's a problem with p.g. e-texts too.) there was a _glossary_ in this edition, so i expanded the _footnote_ routines to handle glossary items as well, and that's a nice addition. it finds the terms automatically, so -- other than enclosing the words within [brackets] in the glossary section -- there's nothing else you'll need to do. (the routine finds the terms in the body-text all by itself, and creates the front-links and back-links automatically.) kinda nifty, if i do say so myself. indeed, i got kind of link-happy. first, i figured that i would create an _html_ version of the text as well. easy enough... then, i decided to have every page of the .pdf _link_up_to_ the .html version online, to demonstrate how you would do scholarly references in my z.m.l. cyberlibrary infrastructure. so each .pdf page (which represents _my_ "original" p-page) also links up to the online .html version, which _also_ mimics the "original" p-page, even displaying the "scan" next to it... so what we've got are _throughly_cross-linked_versions_ that are faithful (gawd, there's that stupid word again) "reprints" of the "original" p-book. this interlocking mesh makes me happy. there were also two references in the book to _other_ books, each time to a specific _page_ in that other book, so i linked those references in the .pdf to those _pages_ in those _books_, again demonstrating how scholarly references are accomplished. some people make a big deal out of such interbook linking, but i show that it's a very simple matter of straightforward execution. in addition, images are two-way-linked to the list of illustrations, and to the next-and-previous illustrations, _and_ to a full-page version of the illustration, plus to an online version of the image. more links than you can shake a stick at, and i wasn't done yet... jon noring has a reference i.d. for every _paragraph_ in his demo version of "my antonia", so i figured i had to match that capacity... and then i decided i'd do that one better, just to be interesting... so i included in the .pdf links to every _line_ in the .html version. but you cannot win this game if you only think one move ahead. so i decided instead that i would link to every _word_ in the .html. that's right, click on any _word_ in that .pdf, and your browser will jump to that exact _word_ in the canonical .html version online... and if noring finds it necessary, i'll link to every goshdarn _letter_. i've already coded the routine, jon. all i have to do is toggle a flag, and _boom_ it goes. don't make me push that button, jon... :+) anyway, all these .pdf links are created automatically, as per z.m.l. rework the parameters, as far as rewrapping or resizing the text, or changing the number of lines per page, and all of the links are automatically recomputed and recreated, without any intervention. after all, that's the kind of thing that computers are good at, right? *** ok, so here's where i admit that i was _wrong_. it doesn't happen too often, folks, because i'm not wrong very often, but when i am, i always admit it, and that's what i'm doing right now, so listen up. whenever i used to think about conversions _from_ z.m.l. format into other formats -- even ones like .html and .pdf which i clearly acknowledged as _useful_ ones -- i downplayed them in my mind. that's because my mission is to make those formats _unnecessary_. so offering a _conversion_ facility just seemed like a waste of time... thus, i put these converters way _way_ down on my list of priorities. indeed, the only reason they were on my list of priorities _at_all_ is because i figured i had to have parity with the heavy-markup crowd, (and -- at least early on -- the ability to convert to "any other format" was one of their big selling points. but they've backed off that now.) i assumed _some_ people would get some usage out of converters, but i never thought that _i_ would have much use for them, if any... however... now that i've got some excellent versions of these converters done, i realize that i was wrong, wrong, wrong. i am going to have _lots_ of use for these babies, yes i am, both the .html versions _and_ .pdf, and -- most especially, i realize now -- the _combination_ of the two! i've been able to imbue them with the same kind of super-navigation that i've always had in my z.m.l. viewer-program, so they are a _great_ way to demo that fantastic feature, so people will be able to get ideas about what z.m.l. means in practice. but also, with the ability to _link_ the versions to each other -- especially the .html version on the web, which serves as the "canonical" version for reference purposes -- i've attained a coherent synergistic package that will be very hard to beat. (just to give one example, i've always thought of annotations like this.) and with the ability of these formats to go places where my viewer-app might not run, i've basically got all the bases covered for my approach. which means that i'm gonna get lotsa mileage outta these converters... so, i was wrong, and i wouldn't have discovered that if i hadn't persisted in coding these converters, so i'd like to thank the people who made me think that these converters were "necessary" in some fashion or another. anyway, i'll be posting all of these files online in the next few days, and i'll let you know when they're available... *** here's the "secret diary of the amazon kindle": > november 18th -- businessweek cover-strory goes up on the web > november 19th -- press conference where jeff announces the thing > november 20th -- whoa! we've now sold all 36 units we had in stock! > november 21st -- place order for another 36 units, with a _rush_ on it. > november 22nd -- thanksgiving been berry berry good to us, yes sir... > november 23rd -- sold out again! place new order, doubling size (72). > november 24th -- back friday rocked! place _another_ double order! > november 25th -- ok, things are settling down, after the initial frenzy. > november 26th -- monday's are _always_ kinda slow with web orders... > november 27th -- maybe we can place a double order (72) tomorrow. > november 28th -- make it a single order (36) -- better safe than sorry. > november 29th -- we seem to have settled in at 9 orders per day. ok... > november 30th -- yep, another 9 orders today. (well, 8, but that's close.) > december 1st -- christmas is on the way, so let's gear up some hype, ok? > > 252 -- total units ordered > 240 -- total units sold > ---- > 012 -- units still held in stock that's just my little "funny" on the people who are saying "wow, the kindle is _sold_out_, so it _must_ be a success!" since we don't know how many units they had in the first place. (and it's ironic, because of all the _rumors_ about the kindle, i don't think _a_single_one_ ever mentioned a _production_run_.) but hey, even if the kindle turns out to be a complete bust, it won't "fail", as bezos has deep enough pockets to keep it around forever if he wants. nobody uses the "wiki" that is offered for every book on the amazon site, but amazon lets it hang around anyway. it'll be the same with the kindle. and perhaps even more importantly, the kindle _won't_ be "a complete bust". yeah, yeah, the d.r.m. stinks -- "defective by design", as the expression goes -- but ordinary people are amazingly tolerant of d.r.m. (until it bites their butt)... and yeah, yeah, there's a wide range of other problems with the kindle as well. but so what? _every_ e-book-machine that has gotten put into enough hands has managed to find a good number of fans. people _loved_ their rocketbooks. they loved their ipaqs. they now love their sony-readers, and love their iliads... and it's all for the exact same reasons that people love _paper_ books, because the love you feel for the _content_ slops over to the medium on which you read. so a good percentage of the people who _buy_ a kindle will _love_ their kindle... not matter _what_ any "critics" say. and that's the bottom line. *** so, really, a back-and-forth on the positives and negatives of the kindle is just the sound of a lot of people yacking... but -- to _my_ mind, anyway -- what _is_ an interesting is why did bezos announce-and-release this thing the way he did? i thought he would be smart enough to pre-announce it and use amazon's huge hype-machine to spin away all of the negatives before the machine was released, and to hype enough interest to make large crowds of buyers appear immediately. if he really wanted to sell this thing for christmas, he'd have announced it in june, and produced enough units that they were available in brick-and-mortar stores... as it was, this mid-november release was too little and too late, and it missed out on its chance for a long marketing campaign, and had to face criticism right away. all of this makes me think that amazon felt forced to announce before they wanted. and the reason _that_ makes me scratch my head is because i saw the _same_thing_ happen with the recent charity-angle sale of the o.l.p.c., which negroponte had been insisting for years that he wouldn't do. (and he had good reasons for that decision.) are these two premature releases related? might one of them have caused the other? if the o.l.p.c. machine proves to be a good book-reader, could it have been seen by amazon to be a "first mover" whom they needed to compete against? or vice versa? or... what if _both_ these premature releases were caused by another development? what if -- as some have speculated -- apple is announcing a tablet-mac in january? of even just a new paperback-sized ipod touch? (what a sweet e-reader that'd be!) if amazon and/or o.l.p.c. got wind of an upcoming tablet-mac, they might've thought they had to get _something_ out the door, and pronto, or be completely swept away... -bowerbird ************************************** Check out AOL's list of 2007's hottest products. (http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071128/74649405/attachment-0001.htm From jon at noring.name Wed Nov 28 14:04:32 2007 From: jon at noring.name (Jon Noring) Date: Wed, 28 Nov 2007 15:04:32 -0700 Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend In-Reply-To: References: Message-ID: <1579433771.20071128150432@noring.name> Bowerbird wrote, in part: > jon noring has a reference i.d. for every _paragraph_ in his demo > version of "my antonia", so i figured i had to match that capacity... Was your motive to match that capacity because I did it, or because it's simply a good thing to do for the benefit of users? It's hard to tell if your motives are to show me up, or to benefit the end-user. Those who read your messages may get the impression you have one very large chip on your shoulder. > and then i decided i'd do that one better, just to be interesting... > so i included in the .pdf links to every _line_ in the .html version. This is also doable in XML since I mark the location of line breaks, an "id" can be added to those if desired. Or having "id" on all the major block-level stuff, one can use the formalism of XPointer to address right down to a letter in a word. > but you cannot win this game if you only think one move ahead. The winners here should be the users, not the developers. > so i decided instead that i would link to every _word_ in the .html. > that's right, click on any _word_ in that .pdf, and your browser will > jump to that exact _word_ in the canonical .html version online... > > and if noring finds it necessary, i'll link to every goshdarn _letter_. > i've already coded the routine, jon.? all i have to do is toggle a flag, > and _boom_ it goes.? don't make me push that button, jon...???? :+) Great work! The important thing is that you've come to realize, as I have been talking about for years, the importance of robust inter- and intra-publication linking. Glad to see you are implementing this in your system. Jon Noring From Bowerbird at aol.com Wed Nov 28 16:04:18 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 28 Nov 2007 19:04:18 EST Subject: [gutvol-d] the sad man still sitting back at the poker table Message-ID: after i'd spent a little time joking around with a few of the other players, cashed in my chips, and had a nice seafood meal in the casino restaurant (during which i tossed back more than a couple of glasses of champagne), i was leaving the joint when i spotted one lonely player still back at the table. he dealt some cards around, to the empty chairs, and then i heard him mutter, "i'll see your bet, and raise you a _new_ e-book listserve", as if the game were still on, and he had any chips left. i snorted out a big laugh, and hit the road... -bowerbird ************************************** Check out AOL's list of 2007's hottest products. (http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071128/9e3b892d/attachment.htm From lee at novomail.net Thu Nov 29 11:19:36 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 29 Nov 2007 12:19:36 -0700 Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend In-Reply-To: References: Message-ID: <474F10C8.8050003@novomail.net> Bowerbird at aol.com wrote: > hope you all had a lovely long weekend... Well, it was a bit frustrating for the very reasons you allude to below. [snip] > the objective is to use the different versions to > converge upon an _error-free_ version of each > -- with as little human interaction as possible -- > retaining linebreaks idiosyncratic to each edition. [snip] > i'll be making my own tool to accomplish this, but > i welcome the efforts of other programmers too... > perhaps then we could compare _our_ outputs to > come to an even _more_ satisfying convergence... I find that the whole problem space quickly gets very thorny. You see, the line breaks you want to retain constitute markup, as do little things like blank lines or indentation to represent paragraph breaks. So the problem becomes how to compare multiple versions of OCRed text without losing the markup. My strategy has been to leverage the GNU diff program, which is quite sophisticated and quite powerful. diff, like all difference engines I am aware of takes a line-oriented approach: it identifies lines of text which are different, in the context of other lines which are identical. So, it order to use difference engine like diff (or Beyond Compare for that matter) the texts to be compared need to be normalized so that, as much as possible, similar text begins similarly. Additionally, good normalization will allow differing text to be synchronized regularly. So the goal is to create normalized texts consisting of a number of lines which start in uniform locations, and which are are relatively short, but not so short that a difference engine can't resync as needed. The basic unit of language seems to me to be the sentence, so it makes sense that a good starting point would be to start each sentence on its own line. Now it's really hard for a computer program to figure out what /is/ a sentence without Natural Language Processing, so I decided to simply start a new line at the first whitespace following sentence-ending punctuation (.?!). This will sometimes cause lines to be broken in odd places (e.g. Dr., Mr., z.m.l. or e.g.) but creation of several smaller lines for comparison purposes is not really a drawback in this instance. Of course, older texts, particularly 19th century texts, use extremely long sentences, so simply creating lines according to punctuation doesn't really create lines which are short enough for comparison purposes. So I chose, for no other reason than gut-feeling, to also wrap lines at 50 characters, at whitespace delimiters. My experience showed, however, that one of the most common OCR errors is in interpreting random defects in paper as punctuation, or in miscounting the number of spaces between words. A single perceived (but not real) punctuation mark can throw off several lines of text. So I designed my word wrapping function to not count whitespace and punctuation when determining a break point. I chose to preserve blank lines that would normally be wrapped otherwise, exclusively because it made it a little more convenient for me during development; there is no reason it should be necessary. Thus, the first chapter of Alice in Wonderland from IA, normalized, would be (I have prefaced each line with '>' to try and prevent mail clients from wrapping the quotes): > DOWN THE BABBIT-HOLE. > > ALICE was beginning to get very tired of sitting by her > sister on the bank, and of having nothing to do: once or twice > she had peeped into the book her sister was reading, but it had > no nictures or conversations in it, "and what is > > 2 DOWN THE > > the use of a boot," thought Alice, "without pictures or > conversations?" > > So she was considering in her own mind, (as well as she could, for > the hot day made her feel very sleepy and stupid,) whether the > pleasure of making a daisy - chain would be worth the trouble This same passage of from the 2003 Perathoner edition would be: > Down the Rabbit-Hole > > Alice was beginning to get very tired of sitting by her > sister on the bank, and of having nothing to do: once or twice > she had peeped into the book her sister was reading, but it had > no pictures or conversations in it, "and what is the use of a > book," thought Alice "without pictures or conversation?" > > So she was considering in her own mind (as well as she could, for > the hot day made her feel very sleepy and stupid), whether the > pleasure of making a daisy-chain would be worth the trouble As you can see, the two passages line up quite well. If the header/footer text can be extracted from the IA text the two passages would probably line up precisely. The obvious problem with this normalization process is that important markup (for you line breaks, for me much more) is lost. My solution to this problem is thanks to Matt Russotto who pointed out to me that markup can be stored segregated from its text. Thus, when normalizing any marked-up text whenever markup is encountered you could record in a separate data segment the place where the markup occurs in the normalized text. For example, if your markup for a line break is "\n", page breaks are "\pg", and paragraphs are "\p", and you were normalizing the IA text of Alice you might have a data segment something like: \n:1:21 \n:2:0 \p:3:0 \n:3:40 \n:4:33 \n:5:19 \n:6:2 \n:6:48 \n:7:0 \pg:8:0 \n:8:10 \n:9:0 \n:10:32 \n:11:15 \n:12:0 \p:13:0 \n:13:39 It should now be possible to "de-normalize" the normalized text by adding back in the markup and get a file identical to what you started with. (This is an important test and validation point; before continuing development, make sure that you can normalize and de-normalize files without data loss or change.) Now, using the above two normalized passages from _Alice in Wonderland_, you should be able to use diff's patch capability to merge changes from one normalized text into a the other normalized text, then use your de-normalize routine to add the markup back into the corrected text. Not surprisingly, this "merge and de-normalize" process is much more complex than it sounds. As a trivial example, if the merge process causes lines to be added to or deleted from the master text, all of the markup locations stored in the data segment will become invalid. Likewise if a change causes a word length to change (as in the infamous 'modem' vs. 'modern' scanno) the location of your line break is going to shift incorrectly. I think that the "merge" process is the most complex and error prone component of the total solution, and I don't currently know how it can be done completely reliably, but I do believe that this paradigm can be used to automate a large part of what is now a purely human effort. From traverso at posso.dm.unipi.it Thu Nov 29 15:54:26 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Fri, 30 Nov 2007 00:54:26 +0100 (CET) Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend In-Reply-To: <474F10C8.8050003@novomail.net> (message from Lee Passey on Thu, 29 Nov 2007 12:19:36 -0700) References: <474F10C8.8050003@novomail.net> Message-ID: <20071129235426.26FEB93B71@posso.dm.unipi.it> Why don't you try wdiff? A lot can be done with it (or with mdiff, of which wdiff is a component). Carlo From Bowerbird at aol.com Thu Nov 29 18:13:37 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 29 Nov 2007 21:13:37 EST Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend Message-ID: carlo said: > Why don't you try wdiff? because it's easier for me to write my own program that will produce better results than i could get out of wdiff? far _far_ better results, as in not even a little bit close... but maybe that's because _i_ don't know how to get wdiff to best do what i specified. if you do, feel free to share it. i'm sure people other than me will benefit from a tutorial. but frankly, i'm quite skeptical wdiff can even _do_ the job. let alone do it well. so go ahead, carlo, prove me wrong... -bowerbird p.s. by the way, this is the same mistake you all made at d.p. with "wordcheck", i.e., having it depend on the aspell checker. that dependence meant the programmer had to twist himself into a pretzel, and _still_ ended up giving you inferior results compared to what he'd have gotten programming that himself. and i told you this, point blank, in advance. but evidently, this was the type of info that was "damaging to your community..." and yeah, maybe "your leaders are making stupid decisions" _is_ a message that's too radical to let your minions be exposed to... but, like i said, prove me wrong... ************************************** Check out AOL's list of 2007's hottest products. (http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071129/08b8ece0/attachment.htm From lee at novomail.net Thu Nov 29 21:35:23 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 29 Nov 2007 22:35:23 -0700 Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend In-Reply-To: <20071129235426.26FEB93B71@posso.dm.unipi.it> References: <474F10C8.8050003@novomail.net> <20071129235426.26FEB93B71@posso.dm.unipi.it> Message-ID: <474FA11B.9000809@novomail.net> Carlo Traverso wrote: > Why don't you try wdiff? A lot can be done with it (or with mdiff, of > which wdiff is a component). An interesting suggestion; what did you have in mind? As you know, wdiff is a front-end to GNU diff which attempts to solve the very problem I mentioned at the beginning of my post: for diff's to be effective, the input files must be normalized. My approach to normalization was to try to force each sentence to begin on a new line, and to wrap sentences thereafter in short segments (approx. 50 characters). wdiff's approach is to normalize the text by putting each /word/ on a separate line, and then making an attempt to reassemble the results into a usable format. One of the wrinkles we face is the requirement Bowerbird established that markup must be retained throughout the process (a requirement which I believe is fundamental). I'm afraid I don't see how wdiff can be used while still meeting that requirement. My approach was to record markup separate from the raw text, with pointers back into the text. For this to work (and I'd welcome alternative suggestions) when changes get merged back into the "master" text (a fairly arbitrary selection, probably based on which version has retained the most markup) the pointers will probably need to be adjusted as corrections are made. Thus, I don't see how wdiff could be used to create a patch file (which might be edited by hand before use) which is then used to patch the master, and finally add the markup back in. On the other hand, maybe the lesson from wdiff is not that the program itself could be used but the approach could be used. Maybe the normalization process should create a file with "lines of words" which the "de-normalization" process could deal with more effectively. It's definitely something I'll experiment with, but if you have any suggestions as to how wdiff could be integrated into the process, please share them. Remember, however, the two most fundamental requirements: 1. markup must be retained from beginning to end (although if it is removed in interim steps that's not a big deal), and 2. The process must be mostly automated; what I am trying to achieve is a mostly automated process which may require some slight human intervention, not a mostly manual process that is augmented by some slight machine-assistance. From marcello at perathoner.de Thu Nov 29 23:00:04 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri, 30 Nov 2007 08:00:04 +0100 Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend In-Reply-To: <474FA11B.9000809@novomail.net> References: <474F10C8.8050003@novomail.net> <20071129235426.26FEB93B71@posso.dm.unipi.it> <474FA11B.9000809@novomail.net> Message-ID: <474FB4F4.3040607@perathoner.de> Lee Passey wrote: Given these two files: > > 'Tis the voice of the sluggard; > I heard him complain, > "You have waked me too soon, > I must slumber again." > and this: > 'Tis the voice of the Lobster; I heard him declare, > 'You have baked me too brown, I must sugar my hair.' what *exact* results do you expect from the diff? -- Marcello Perathoner webmaster at gutenberg.org From robert_marquardt at gmx.de Fri Nov 30 01:31:40 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Fri, 30 Nov 2007 10:31:40 +0100 Subject: [gutvol-d] The Advent Calendar will be up tomorrow Message-ID: Have a look at the tst version here: http://www.gutenberg.org/wiki/User:Marcello/ImageMapTest -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From klofstrom at gmail.com Fri Nov 30 03:22:42 2007 From: klofstrom at gmail.com (Karen Lofstrom) Date: Fri, 30 Nov 2007 01:22:42 -1000 Subject: [gutvol-d] The Advent Calendar will be up tomorrow In-Reply-To: References: Message-ID: <1e8e65080711300322r35c60b59hed40ac52dc3fb4b5@mail.gmail.com> On Nov 29, 2007 11:31 PM, Robert Marquardt wrote: > Have a look at the tst version here Perfect picture! Myself, I'd prefer a different font for the numbers -- something serif or ornate. Perhaps a different color? Silver? I'd put an ornate frame around the picture, and elaborate the lines dividing it into sections. Something more Victorian Christmassy. Also, I'd like the numbers in the same position in each rectangle (bottom center?), but then I'm a stickler for symmetry. Feel free to ignore me as an outlier, unless others feel the same way. The concept as a whole, however, is just fine. Thanks so much for working on it. -- Karen Lofstrom From johnson.leonard at gmail.com Fri Nov 30 03:36:26 2007 From: johnson.leonard at gmail.com (Leonard Johnson) Date: Fri, 30 Nov 2007 06:36:26 -0500 Subject: [gutvol-d] The Advent Calendar will be up tomorrow In-Reply-To: References: Message-ID: <748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com> On Nov 30, 2007 4:31 AM, Robert Marquardt wrote: > Have a look at the tst version here: > http://www.gutenberg.org/wiki/User:Marcello/ImageMapTest > -- > Robert Marquardt (Team JEDI) http://delphi-jedi.org > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d I like it as is. Is this going to remain on the user wiki? Is there a possibility for a link from the main page? Len Johnson -- http://members.cox.net/leaonarddjohnson/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071130/c4c29a94/attachment.htm From robert_marquardt at gmx.de Fri Nov 30 06:14:56 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Fri, 30 Nov 2007 15:14:56 +0100 Subject: [gutvol-d] The Advent Calendar will be up tomorrow In-Reply-To: <1e8e65080711300322r35c60b59hed40ac52dc3fb4b5@mail.gmail.com> References: <1e8e65080711300322r35c60b59hed40ac52dc3fb4b5@mail.gmail.com> Message-ID: <3c60l39oof53ho2b6p2vo20srrdbdjli6i@4ax.com> On Fri, 30 Nov 2007 01:22:42 -1000, you wrote: >Myself, I'd prefer a different font for the numbers -- something serif >or ornate. Perhaps a different color? Silver? I'd put an ornate frame >around the picture, and elaborate the lines dividing it into sections. >Something more Victorian Christmassy. > >Also, I'd like the numbers in the same position in each rectangle >(bottom center?), but then I'm a stickler for symmetry. Feel free to >ignore me as an outlier, unless others feel the same way. I had to ask for help because i am not able to do any work right now. I got this and accepted it as it is. Yes, there are many ideas for the designs, but you could work on it for weeks and drown in all those designs. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From robert_marquardt at gmx.de Fri Nov 30 06:22:18 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Fri, 30 Nov 2007 15:22:18 +0100 Subject: [gutvol-d] The Advent Calendar will be up tomorrow In-Reply-To: <748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com> References: <748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com> Message-ID: <3n60l3devcg49a88g6h5gu6m46gedtt5mb@4ax.com> On Fri, 30 Nov 2007 06:36:26 -0500, you wrote: >I like it as is. > >Is this going to remain on the user wiki? Is there a possibility for a link >>from the main page? Of course. Just like the Christmas Bookshelf we promoted last year (and will promote this year also) The SF CD promotion will be replaced tomorrow. Next year i think we should promote the Children bookshelves. The Advent Calendar page will be removed on Dec 25. Next year we can create a new one. I am not sure if we should do it again next year though. Better do something new like a Christmas CD. In two years maybe a calendar again, but with audio books. I will challenge Librivox for that. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org From lee at novomail.net Fri Nov 30 09:12:48 2007 From: lee at novomail.net (Lee Passey) Date: Fri, 30 Nov 2007 10:12:48 -0700 Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend In-Reply-To: <474FB4F4.3040607@perathoner.de> References: <474F10C8.8050003@novomail.net> <20071129235426.26FEB93B71@posso.dm.unipi.it> <474FA11B.9000809@novomail.net> <474FB4F4.3040607@perathoner.de> Message-ID: <47504490.90600@novomail.net> Marcello Perathoner wrote: > Lee Passey wrote: > > > > Given these two files: > >> >> 'Tis the voice of the sluggard; >> I heard him complain, >> "You have waked me too soon, >> I must slumber again." >> >> > > and this: > >> 'Tis the voice of the Lobster; I heard him declare, >> 'You have baked me too brown, I must sugar my hair.' >> > > what *exact* results do you expect from the diff? > An excellent question. First, let me thank you for the example, it has helped me refine my own algorithms to be more precise. Step one: normalize the two files. Using my current algorithm creating lines of approx. 50 characters you get: [start poem.xml.norm] 'Tis the voice of the sluggard; I heard him complain, "You have waked me too soon, I must slumber again." [end poem.xml.norm] and [start poem.txt.norm] 'Tis the voice of the Lobster; I heard him declare, 'You have baked me too brown, I must sugar my hair.' [end poem.txt.norm] Step two: compare the two normalized files. The resulting diff file is: [start poem.diff] 1,2c1,9 < 'Tis the voice of the sluggard; I heard him complain, "You have < waked me too soon, I must slumber again." --- > 'Tis the voice of the Lobster; I heard him declare, 'You have > baked me too brown, I must sugar my hair.' > > > > > > > [end poem.diff] That was the easy part. Step 3 is more complex: decide which of the two competing versions is the one you want in the result. The portion of the diff file that represents the markup can be discarded at this point. For a completely automated solution, you would want to repeat this process with other versions of the same text, and perhaps using a voting algorithm select the text which the majority of versions consider correct. Other options include considering one text as canonical, or actually having a human edit the diff file so that only desired changes remain. So far, this step is where I have expended the least amount of effort. Step 4 is the hardest: merging accepted changes from the diff file back into the "master" file. Interestingly, your example is quite easy to merge back in. Assuming that all the changes from the text file are preferable, my current program yields: [start newpoem.xml] 'Tis the voice of the Lobster; I heard him declare, 'You have baked me too brown, I must sugar my hair.' [end newpoem.xml] What I am discovering is that the "de-normalization" program, which merges the changes and restores the markup, seems to be following the 80/20 rule: 80% of the cases can be solved fairly easily; the remaining 20% of the cases will require 4 times the effort. Actually, it's starting to look more like a 95/5 rule; the 5% of the changes which are anomalous seem to be highly intractable. Mr. Traverso's suggestion to use word-based normalization may help solve these problems; but in some cases I'm afraid that the only solution may be to embed a milestone in the resulting output and require a human to resolve the discrepancy. From Bowerbird at aol.com Fri Nov 30 13:38:09 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 30 Nov 2007 16:38:09 EST Subject: [gutvol-d] hope you had a lovely holiday Message-ID: oh gee, there's some activity in my spam folder. do i open it up? or leave it be? it's friday, the weekend!, so i do believe i'll ignore it. maybe monday i'll look at it. or maybe not. (if anyone wants to advise me to not even bother, as it's worthless, those'll be welcome words to my ears.) meanwhile, i'm still looking forward to carlo's tutorial. -bowerbird ************************************** Check out AOL's list of 2007's hottest products. (http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071130/28ae7ade/attachment.htm From piggy at netronome.com Fri Nov 30 15:40:25 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Fri, 30 Nov 2007 18:40:25 -0500 Subject: [gutvol-d] The Advent Calendar will be up tomorrow In-Reply-To: <3n60l3devcg49a88g6h5gu6m46gedtt5mb@4ax.com> References: <748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com> <3n60l3devcg49a88g6h5gu6m46gedtt5mb@4ax.com> Message-ID: <47509F69.20301@netronome.com> Robert Marquardt wrote: > On Fri, 30 Nov 2007 06:36:26 -0500, you wrote: > > >> I like it as is. >> >> Is this going to remain on the user wiki? Is there a possibility for a link >> >from the main page? >> > > Of course. Just like the Christmas Bookshelf we promoted last year (and will promote this year also) > The SF CD promotion will be replaced tomorrow. Next year i think we should promote the Children bookshelves. > > The Advent Calendar page will be removed on Dec 25. Next year we can create a new one. > I am not sure if we should do it again next year though. Better do something new like a Christmas CD. > In two years maybe a calendar again, but with audio books. I will challenge Librivox for that. > Will the links be enabled separately day by day? From robert_marquardt at gmx.de Fri Nov 30 21:58:46 2007 From: robert_marquardt at gmx.de (Robert Marquardt) Date: Sat, 01 Dec 2007 06:58:46 +0100 Subject: [gutvol-d] The Advent Calendar will be up tomorrow In-Reply-To: <47509F69.20301@netronome.com> References: <748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com> <3n60l3devcg49a88g6h5gu6m46gedtt5mb@4ax.com> <47509F69.20301@netronome.com> Message-ID: <2vt1l3dhg8beqp31guqq8aaltc06198r91@4ax.com> On Fri, 30 Nov 2007 18:40:25 -0500, you wrote: >Will the links be enabled separately day by day? No. Just like a chocolate calendar you should be able to abuse it. -- Robert Marquardt (Team JEDI) http://delphi-jedi.org