The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::Inspect::Normalize - normalize urls

INHERITANCE

  HTML::Inspect::Normalize
    is an Exporter

SYNOPSIS

  set_page_base($base_url);  # used as base for relative urls
  my $norm = normalize_url($relative_url);
  my ($norm, $rc, $err) = normalize_url($relative_url);

DESCRIPTION

Although being part of module HTML::Inspect, it has a right of its own: the functions really, really fast convert sloppy http and https urls as found on webpages into cleanly normalized urls.

FUNCTIONS

normalize_url($url)

Normalize a URL relative to the base (which needs to be set first). Same returns as set_page_base().

set_page_base($base_url)

In LIST context, returns the normalized_url (string), rc, and errmsg. In SCALAR content, only returns the normalized_url and casts error exception when a problem was found. The base is normalized before use.

DETAILS

See also https://pipeline.shared-search.eu/extract/normalize.html

The following actions are taken:

  • leading and trailing blanks are stripped

  • spaces (CR, LF, TAB, VTAB) are moved, and following blanks as well

  • relative urls are converted to absolute

  • '+' and included blanks are converted to %20

  • hex representation of normal characters (which includes comma and more) is converted back into their character

  • characters which need to be encoded are converted to hex

  • hex digits are upper-cased

  • utf8 characters get hex encoded

  • hex encoding must be valid utf8, possibly multi-byte

  • fragment is removed

  • empty path will becomde '/'

  • remove ./ and ../

  • removed repeating slashes

  • hostnames with utf8 get IDN encoded

  • hostname syntax verified

  • remove trailing dot from hostname

  • default port numbers removed

  • port numbers leading zeros removed, restricted to max 65535

SEE ALSO

HTML::Inspect, URI::Fast

SEE ALSO

This module is part of HTML-Inspect distribution version 1.00, built on December 08, 2021. Website: http://perl.overmeer.net/CPAN/

LICENSE

Copyrights 2021 by [Mark Overmeer <markov@cpan.org>]. For other contributors see ChangeLog.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://dev.perl.org/licenses/