From webmaster@javawoman.com Tue Jan 12 00:05:56 1999 Date: Tue, 12 Jan 1999 06:59:29 +0100 From: Marjolein Katsma To: htdig@sdsu.edu Subject: htdig: Comments (2) The following is a patch to the original algorithm for skipping comments before doing any further parsing of an HTML file. The original algorithm fails to see (legal) comment declarations with whitespace after the (final) comment in the declaration and will end up not indexing the whole document if such a declaration is found. I've made a small correction to my original code so that in case of a comment declaration that doesn't seem to have an ending '>' the rest of the document is just skipped without preventing the rest being indexed. Otherwise 'illegal' comment declarations will be skipped until a final '>'. Patch is a *replacement* of my original one (which was mailed but somehow never made it to the list); compared with HTML.cc from the original 3.1.0b4 release: diff -3p HTML.cc.orig HTMLcommentMK.cc *** HTML.cc.orig Tue Dec 22 18:53:12 1998 --- HTMLcommentMK.cc Mon Jan 11 22:46:49 1999 *************** *** 3,8 **** --- 3,15 ---- // // Implementation of HTML // + // Revision 1999-01-07/1999-01-09 mkatsma + // Modification of comment-filtering algorithm so it skips all legal SGML + // comment declarations, including ones with whitespace after the last + // comment in the declaration. Illegal comment declarations are skipped + // till the next '>' without preventing the rest of the document being + // indexed. + // // $Log: HTML.cc,v $ // Revision 1.23 1998/12/12 01:48:52 ghutchis // Fix coredump when META refresh tags don't have content portions (e.g. no URL). *************** HTML::parse(Retriever &retriever, URL &b *** 181,198 **** while (*position) { ! if (strncmp((char *)position, ""); ! if (!q) ! return; // Rest of document is a comment... ! position = q + 3; continue; } if (*position == '<') { // --- 188,251 ---- while (*position) { ! ! // Improved algorithm 1999-01-07 Marjolein Katsma ! // (with help from Gilles Detillieux) ! // Small fix 1999-01-09 ! if (strncmp((char *)position, "': ! // we have to ignore a complete comment declarations ! // but of course also DTD declarations. // ! position += 2; // Get past declaration start ! while (*position) ! { ! // Let's see if the declaration ends here ! if (*position == '>') ! { ! position++; ! break; // End of comment declaration ! } ! // Not the end of the declaration yet: ! // we'll see if it is an actual comment (should start right here) ! if (strncmp((char *)position, "--", 2) == 0) ! { // Start of comment - now find the end ! position += 2; ! q = (unsigned char*)strstr((char *)position, "--"); ! if (!q) ! { ! *position = '\0';// Rest of document (comment?) will be skipped ! break; ! } ! position = q + 2; ! } ! else ! { // Not a (legal) comment declaration after all; ! // could be illegal comment or DTD: ! // get to the end ! q = (unsigned char*)strstr((char *)position, ">"); ! if (q) ! { ! position = q + 1; ! break; // End of (whatever) declaration ! } ! else ! { ! *position = '\0';// Rest of document (DTD?) will be skipped ! break; ! } ! } ! // Skip whitespace after an individual comment ! while (isspace(*position)) ! position++; ! } continue; } + + if (*position == '<') { // Marjolein Katsma webmaster@javawoman.com Java Woman - http://javawoman.com/ ---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to htdig-request@sdsu.edu containing the single word "unsubscribe" in the body of the message.