The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

WWW::LinkRot - check web page link rot

SYNOPSIS

    use WWW::LinkRot;

VERSION

This documents version 0.02 of WWW-LinkRot corresponding to git commit e07a0ffb766775fc053e9820edf1f874ee40b78c released on Fri Apr 23 08:30:11 2021 +0900.

DESCRIPTION

Scan HTML files for links, try to access the links, and make a report.

The HTML files need to be in UTF-8 encoding.

This module is intended for people who run web sites to run, for example, periodic checks over a large number of HTML files to find all of the external links in those files, then given that list of links, test each link to make sure that it is actually valid.

The reading function is "get_links" which works on a list containing file names such as might be created by a module like Trav::Dir or File::Find. It looks for any https?:// links in the files and makes a list.

The list of links may then be checked for validity using "check_links" which runs the get method of "LWP::UserAgent" on them and stores the status. This outputs a JSON file containing the link, the status, the location, and the files which contain the link.

The function "html_report" generates an HTML representation of the JSON file.

The function "replace" is a batch editing function which inputs a list of links and a list of files, then substitutes the redirected links (the ones with status 301 or 302) with their replacement.

FUNCTIONS

    check_links ($links);

Check the links returned by "get_links" and write to a JSON file specified by the out option.

    check_links ($links, out => "link-statuses.json");

Usually one would filter the links returned by "get_links" to remove things like internal links.

Options

nook

If this is set to a true value, before running the link checks, check_links reads in a previous copy of the file specified by the out option, and if the status is 200 for that link, it doesn't try to access again but assumes it is still OK.

This option is useful for the case when one has recently run the job and then done work on fixing the dead links or moved links, then wants to check whether the errors were fixed, without checking that all of the pages are still OK.

out

Specify the file to write. Without this specified it will fail.

verbose

Print messages about what is to be done. Since checking the links might take a long time, this is sometimes reassuring.

The user agent

The user agent used by WWW::LinkRot is "LWP::UserAgent" with the timeout option set to 5 seconds and the number of redirects set to zero. If a timeout is not used, check_links may take a very long time to run. However, some links, like archive.org links may take more than five seconds to respond.

The user agent set to the browser is WWW::LinkRot.

    my $links = get_links (\@files);

Given a list of HTML files in @files, extract all the links from it. The return value $links contains a hash reference whose keys are the links and whose values are array references containing a list of all the files of @files which contain the link.

This looks for anything of the form href="*" in the files and adds what is between the quotes to the list of links.

html_report

    html_report (in => 'link-statuses.json', out => 'report.html');

Write an HTML report using the JSON output by "get_links". The report consists of header HTML generated by "HTML::Make::Page" followed by a table consisting of rows with links in each row, followed by its status, followed by the pages where it is used.

Options

in

The input JSON file

nofiles

If set to a true value, don't add the final "files" column. For example this may be used if only checking a single file for dead links.

out

The output HTML file.

strip

Part of the file name which needs to be stripped from the file names to make a URL, like "/home/users/jason/website".

url

Part of the URL which needs to be added to the file names to make a URL, like "https://www.example.com/site";

The output HTML file

Moved links are coloured pink, and dead links are coloured yellow.

Links are cut down to a maximum length of 100 characters.

replace

    replace (\%links, \@files, %options);

Make a regex of links with 30* (redirect) statuses like 301 and 302, and which also have a valid location, then go through @files and replace the links with the new locations.

Options are

verbose

Print messages about the links and the files being edited.

DEPENDENCIES

Convert::Moji

This is used to make the regex used by "replace".

File::Slurper

This is used for reading and writing files.

HTML::Make

This is used to make the HTML report about the links.

HTML::Make::Page

This is used to make the HTML report about the links.

JSON::Create

This is used to make the report file about the links.

JSON::Parse

This is used to read back the JSON report.

LWP::UserAgent

This is used to check the links.

SEE ALSO

CPAN

HTML::LinkExtor
HTTP::SimpleLinkChecker
WebFetch
W3C::LinkChecker
WWW::LinkChecker::Internal

Other

We used this more than ten years ago, it seemed to work very well. It hasn't been updated in ten years though.

A web site which checks the links on your website.

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2021 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.