The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Unicode::Casing - Perl extension to override system case changing functions

SYNOPSIS

  use Unicode::Casing
            uc => \&my_uc, lc => \&my_lc,
            ucfirst => \&my_ucfirst, lcfirst => \&my_lcfirst;
  no Unicode::Casing;

DESCRIPTION

This module allows overriding the system-defined character case changing functions. Any time something in its lexical scope would ordinarily call lc(), lcfirst(), uc(), or ucfirst() the corresponding user-specified function will instead be called. This applies to direct calls, and indirect calls via the \L, \l, \U, and \u escapes in double quoted strings and regular expressions.

Each function is passed a string to change the case of, and should return the case-changed version of that string. Using, for example, \U inside the override function for uc() will lead to infinite recursion, but the standard casing functions are available via CORE::. For example,

 sub my_uc {
    my $string = shift;
    print "Debugging information\n";
    return CORE::uc($string);
 }
 use Unicode::Casing uc => \&my_uc;
 uc($foo);

gives the standard upper-casing behavior, but prints "Debugging information" first.

It is an error to not specify at least one override in the "use" statement. Ones not specified use the standard version. It is also an error to specify more than one override for the same function.

use re 'eval' is not needed to have the inline case-changing sequences work in regular expressions.

Here's an example of a real-life application, for Turkish, that shows context-sensitive case-changing.

 sub turkish_lc($) {
    my $string = shift;

    # Unless an I is before a dot_above, it turns into a dotless i (the
    # dot above being attached to the I, without an intervening other
    # Above mark; an intervening non-mark (ccc=0) would mean that the
    # dot above would be attached to that character and not the I)
    $string =~ s/I (?! [^\p{ccc=0}\p{ccc=Above}]* \x{0307} )/\x{131}/gx;

    # But when the I is followed by a dot_above, remove the dot_above so
    # the end result will be i.
    $string =~ s/I ([^\p{ccc=0}\p{ccc=Above}]* ) \x{0307}/i$1/gx;

    $string =~ s/\x{130}/i/g;

    return CORE::lc($string);
 }

A potential problem with context-dependent case changing is that the routine may be passed insufficient context, especially with the in-line escapes like \L.

turkish.t, which comes with the distribution includes a full implementation of all the Turkish casing rules.

AUTHOR

Karl Williamson, <khw@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2011 by Karl Williamson

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.