From wjones@tc.fluke.com Tue Jan 11 22:26:18 2000 Date: Tue, 11 Jan 2000 16:04:33 -0800 From: Warren Jones To: htdig3-dev@htdig.org Subject: [htdig3-dev] Fixes for valid_extensions I was very happy to find that the "valid_extensions" option has been added in version 3.1.4 -- something like this is essential given the rather chaotic nature of the web server that I have to index. But I found that a couple changes were necessary to make valid_extensions work the way I wanted it to. If "valid_extensions" are defined, I'd like to retrieve URL's without extensions *if_and_only_if* they represent a directory. However, I found that all URL's without extensions are rejected if the URL contains a fully qualified domain name, e.g.: http://www.foo.com/bar/ Retriever::IsValidURL() rejects this URL because it thinks the extension is: .com/bar/ The patch for Retriever.cc (included below) fixes this. To insure that a URL without an extension will be retrieved only if it's a directory, I modified URL::normalize() so that a slash is appended to any URL that doesn't have an extension. This guarantees that retrieval will fail if the URL is not a directory. This works for me, but I'm not sure that it's the best solution -- comments would be appreciated. -- Warren Jones Fluke Corporation ---------------------------- snip snip ---------------------------- Index: Retriever.cc =================================================================== RCS file: /home/wjones/src/CVS.repo/htdig/htdig/Retriever.cc,v retrieving revision 1.1.1.5 diff -c -r1.1.1.5 Retriever.cc *** Retriever.cc 1999/12/15 22:06:09 1.1.1.5 --- Retriever.cc 2000/01/11 00:28:29 *************** *** 702,707 **** --- 702,709 ---- // char *ext = strrchr(url, '.'); String lowerext; + if ( ext && strchr(ext,'/') ) // Ignore a dot if it's not in the + ext = NULL; // final component of the path. if (ext) { lowerext = ext; Index: URL.cc =================================================================== RCS file: /home/wjones/src/CVS.repo/htdig/htlib/URL.cc,v retrieving revision 1.1.1.5 diff -c -r1.1.1.5 URL.cc *** URL.cc 1999/12/15 22:06:35 1.1.1.5 --- URL.cc 2000/01/11 23:09:26 *************** *** 469,474 **** --- 469,490 ---- removeIndex(_path); + if ( *config["valid_extensions"] != '\0' ) + { + // If we're only accepting valid extensions, then append + // a trailing slash to any URL without an extension. + // This insures that the only URL's without extensions + // we retrieve will be directories. + + char *slash = strrchr( _path, '/' ); + if ( ! slash || slash[1] != '\0' ) + { + char *dot = strrchr( _path, '.' ); + if ( dot <= slash ) + _path << "/"; + } + } + // // Convert a hostname to an IP address // ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.