From grdetil@scrc.umanitoba.ca Wed Aug 25 14:53:53 1999 Date: Wed, 25 Aug 1999 16:08:55 -0500 (CDT) From: Gilles Detillieux To: htdig@htdig.org Subject: [htdig] patch for improved compound word handling in htdig Finally, after much anticipation and fanfare (OK, maybe not :), here is my compound word patch. I'd appreciate any feedback from those who try it out. I'd recommend applying my previous patches before applying this one, especially the excerpt highlighting patch I posted yesterday, which is sort of a companion to this one. Of course, this patch won't have any effect until you re-index. This patch improves htdig's handling of compound words, like post-doctoral and such, to add each individual part, as well as the whole, into the word database. This allows searches for individual parts, like "doctoral", to find those parts in hyphenated (or otherwise punctuated) compound words. It should also fix the problem with "d'" in French text. The code seems quite convoluted because it's designed to handle all the combinations of parts in multi-hyphen-compound-words. --- htdig-3.1.2.bak/htdig/Retriever.cc Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/Retriever.cc Wed Aug 25 15:36:12 1999 @@ -879,6 +879,56 @@ Retriever::got_word(char *word, int loca HtStripPunctuation(w); if (w.length() >= minimumWordLength) words.Word(w, location, current_anchor_number, factor[heading]); + if (strcmp(word, w.get()) != 0) // have punctuation that was stripped + { + // Check for compound words... + String parts = word; + int added; + int nparts = 1; + do + { + added = 0; + char *start = parts.get(); + char *punctp, *nextp, *p; + char punct; + int n; + while (*start) + { + p = start; + for (n = 0; n < nparts; n++) + { + while (HtIsStrictWordChar((unsigned char)*p)) + p++; + punctp = p; + if (!*punctp && n+1 < nparts) + break; + while (*p && !HtIsStrictWordChar((unsigned char)*p)) + p++; + if (n == 0) + nextp = p; + } + if (n < nparts) + break; + punct = *punctp; + *punctp = '\0'; + if (*start && (*p || start > parts.get())) + { + w = start; + HtStripPunctuation(w); + if (w.length() >= minimumWordLength) + { + words.Word(w, location, current_anchor_number, factor[heading]); + if (debug > 3) + cout << "word part: " << start << '@' << location << endl; + } + added++; + } + start = nextp; + *punctp = punct; + } + nparts++; + } while (added > 2); + } } } --- htdig-3.1.2.bak/htdig/PDF.cc Wed Aug 18 16:40:30 1999 +++ htdig-3.1.2/htdig/PDF.cc Wed Aug 25 15:41:01 1999 @@ -525,16 +525,11 @@ void PDF::parseString() if (word.length() >= minimumWordLength) { - word.lowercase(); - HtStripPunctuation(word); - if (word.length() >= minimumWordLength) - { - _retriever->got_word(word, - int(_curPage * 1000 / _pages), - 0); - if (debug > 3) - printf("PDF::parseString: got word %s\n", word.get()); - } + _retriever->got_word(word, + int(_curPage * 1000 / _pages), + 0); + if (debug > 3) + printf("PDF::parseString: got word %s\n", word.get()); } } --- htdig-3.1.2.bak/htdig/Plaintext.cc Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/Plaintext.cc Wed Aug 25 15:40:13 1999 @@ -72,14 +72,9 @@ Plaintext::parse(Retriever &retriever, U if (word.length() >= minimumWordLength) { - word.lowercase(); - HtStripPunctuation(word); - if (word.length() >= minimumWordLength) - { - retriever.got_word(word, - int(offset * 1000 / contents->length()), - 0); - } + retriever.got_word(word, + int(offset * 1000 / contents->length()), + 0); } } -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.