From marc.pohl@wdr.de Thu Jan 13 13:46:24 2000 Date: Thu, 13 Jan 2000 20:41:04 +0100 From: Marc Pohl To: Geoff Hutchison Cc: htdig3-dev@htdig.org Subject: Re: [htdig3-dev] htdig 3.1.4 is not 8-bit-clean on solaris At 17:47 12.01.00 -0600, you wrote: >At 6:49 PM +0100 1/12/00, Marc Pohl wrote: >>i reviewed the sourcecode for htdig-3.2.0b1-dev-010900 this weekend >>and discovered that there could be similar errors in >>htword/WordType.cc because of signed char to int casts. The exactly >>same error cannot happen because the iscntrl() is in the else branch >>of IsStrictChar() in 3.2. > >Could you also post your original patch to 3.1.4 with diff -c as >well? I'd like to have it on the htdig@htdig.org lists because I >think it will help some of these recent questions about indexing and >searching foreign characters. > >>My proposed patch is the following snippet, introducing two new >>member functions to WordType, instead of calling isdigit() and >>iscntrl() directly. > >This looks fine to me. Since it's a bug-fix, unless I hear screams of >protest, it's going in sometime tomorrow. > >-Geoff > Hello Geoff, Yesterday i found a small potential problem in the patched code: At the beginning of the initialisation of WordType is the line chrtypes[0] = 0; Because we never call iscntrl(0) this line must be chrtypes[0] = WORD_TYPE_CONTROL; During my tests this make no difference, but i think that i don't have any unwanted #0 in my html-docs. Marc And here is my patch against the version 3.1.4: *** WordList.cc.orig Fri Dec 10 01:28:44 1999 --- WordList.cc Thu Jan 13 20:23:29 2000 *************** *** 108,125 **** while (word && *word) { ! if (HtIsStrictWordChar((unsigned char)*word) && !isdigit(*word)) { alpha = 1; // break; /* Can't stop here, there may still be control chars! */ } ! else if (allow_numbers && isdigit(*word)) { alpha = 1; // break; /* Can't stop here, there may still be control chars! */ } // if (*word >= 0 && *word < ' ') ! else if (iscntrl(*word)) { control = 1; break; --- 108,125 ---- while (word && *word) { ! if (HtIsStrictWordChar((unsigned char)*word) && !isdigit((unsigned char)*word)) { alpha = 1; // break; /* Can't stop here, there may still be control chars! */ } ! else if (allow_numbers && isdigit((unsigned char)*word)) { alpha = 1; // break; /* Can't stop here, there may still be control chars! */ } // if (*word >= 0 && *word < ' ') ! else if (iscntrl((unsigned char)*word)) { control = 1; break; I hope that my email program will not mangle that ;-) ----------------------------------------------------- Marc Pohl Westdeutscher Rundfunk Tel.: +49 221 220 8618 OSC/Videotextredaktion FAX: +49 221 220 3882 D-50600 Koeln Email: marc.pohl@wdr.de ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.