From grdetil@scrc.umanitoba.ca Tue Mar 20 13:51:56 2001 Date: Tue, 20 Mar 2001 15:18:32 -0600 (CST) From: Gilles Detillieux To: Malcolm Austen Cc: Gilles Detillieux , htdig-general@lists.sourceforge.net Subject: Re: [htdig] Errors in reported hopcounts - PATCH According to Malcolm Austen: > On Tue, 13 Mar 2001, Malcolm Austen wrote: > + On Mon, 12 Mar 2001, Gilles Detillieux wrote: > + + Well, if you have a simple test set of data that produces this problem > + + in 3.1.5, then please do bore us with it. Even though there have been > + + substantial changes for 3.2, much of those have been backported to 3.1.5, > + + so if the problem remains in 3.1.x but not in 3.2.x, I'd like to know what > + + the cause is, so we can address this if/when we start working on 3.1.6. > > Gilles, (and anyone else who cares to try to resolve this!!) > > I got worried yesterday afternoon that I was not going to be able to > reproduce the fault without indexing 20,000 documents. Fortunately I did > manage it with just one server and (just)under 600 documents. > > I have indexed (config file at the end of this message) with a hop count > of one and then again with a hopcount of two. The result of the second run > is some 299 documents with hopcounts of 1 that were not indexed in the > first run. ... OK, after looking at the files on your site, I was able to reproduce the problem on my site. (Actually the problem was there all along, I just didn't know what to look for because hop count isn't an issue for us.) It turns out that htdig does a depth-order traversal of the document tree, so really the hop count should always be increasing, never decreasing. Hunting around in the code, I was able to find out why it was decreasing, and looking back in earlier versions, I found out when it broke. It was working in versions 3.0.8b2 and 3.1.0b1, but broken in 3.1.0b2. In 3.2.0b3, Geoff tried to fix it, but IMHO ended up breaking it even more, with this patch: "http://www.htdig.org/mail/1998/11/0345.html". The problem was that in preparation for 3.1.0b2, Geoff made a number of other hopcount-related changes. These were all good, as far as I can tell, except for one: inexplicably, he reversed the comparison from: if (ref->DocHopCount() > currenthopcount + 1) to: if (ref->DocHopCount() < currenthopcount + 1) causing htdig to take the higher hop count rather than the lower one! This led to the problem which led to the patch referenced above. My fix is to go back to the way 3.0.8b2 did it, but without losing all the other good fixes in 3.1.0b2. This fix should be applied to both the 3.1 and 3.2 series. Geoff, can you verify this change. Your patch of 1998/11 doesn't make sense to me, but maybe I'm not grasping what your intention really was. Why should an href in the current document cause the current document's hop count to drop? It seems any change should be to the referenced document, not to the current one. Can you let me know if my patch breaks anything? Malcolm, and anyone else who experienced hop count problems in 3.1.5, can you please test this patch and let me know if it fixes your problems and/or causes new ones? Use "patch -p0 < this-message-file" in your htdig-3.1.5 source directory... --- htdig/Retriever.cc.orig Thu Feb 24 20:29:10 2000 +++ htdig/Retriever.cc Tue Mar 20 14:45:24 2001 @@ -1211,11 +1211,8 @@ Retriever::got_href(URL &url, char *desc return; } - if (ref->DocHopCount() != -1 && - ref->DocHopCount() < currenthopcount + 1) - // If we had taken the path through this ref - // We'd be here faster than currenthopcount - currenthopcount = ref->DocHopCount(); // So update it! + if (ref->DocHopCount() > currenthopcount + 1) + ref->DocHopCount(currenthopcount + 1); docs.Add(*ref); -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list To unsubscribe, send a message to with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html