The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

Name

SPVM::Document::Language::Tokenization - Tokenization in the SPVM Language

Description

This document describes the tokenization in the SPVM language.

Tokenization

This section describes the lexical analysis in the SPVM Language.

This is called tokenization.

See SPVM::Document::Language::SyntaxParsing about syntax parsing.

Character Encoding

The character encoding of SPVM source codes is UTF-8.

If a character is an ASCII character, it must be an ASCII printable character or a space character.

Compilation Errors:

The charactor encoding of SPVM source codes must be UTF-8. Otherwise a compilation error occurs.

If a character is an ASCII character, it must be an ASCII printable character or a space character. Otherwise a compilation error occurs.

Line Terminators

The line terminator is ASCII LF.

When a line terminator appears, the current line number is incremented by 1.

Space Characters

The space characters are ASCII SP, HT, FF, LF.

Word Characters

The word characters are ASCII a-zA-Z, 0-9, _.

Names

This section describes names.

Symbol Name

A symbol name consists of word characters and ::.

It dose not contains __.

It dose not begin with 0-9.

It dose not begin with ::.

It dose not end with ::.

It dose not contains ::::.

It dose not begin with 0-9.

Compliation Errors:

If a symbol name is invald, a compilation error occurs.

Examples:

  # Symbol names
  foo
  foo_bar2
  Foo::Bar
  
  # Invalid symbol names
  2foo
  foo__bar
  ::Foo
  Foo::
  Foo::::Bar

Class Name

A class name is a symbol name.

Each partial name of a class name must begin with an uppercase letter.

Partial names are individual names separated by ::. For example, the partial names of Foo::Bar::Baz are Foo, Bar, and Baz.

Compilation Errors:

If a class name is invalid, a compilation error occurs.

Examples:

  # Class names
  Foo
  Foo::Bar
  Foo::Bar::Baz3
  Foo::bar
  Foo_Bar::Baz_Baz
  
  # Invalid class names
  Foo
  Foo::::Bar
  Foo::Bar::
  Foo__Bar
  Foo::bar

Method Name

A method name is a symbol name without :: or an empty string "".

Method names with the same name as keywords are allowed.

Compilation Errors:

If a method name is invalid, a compilation error occurs.

Examples:

  # Method names
  FOO
  FOO_BAR3
  foo
  foo_bar
  _foo
  _foo_bar_
  
  # Invalid method names
  foo__bar
  3foo

Field Name

A field name is a symbol name without ::.

Field names with the same name as keywords are allowed.

Compilation Errors:

If a field names is invalid, a compilation error occurs.

Examples:

  # Field names
  FOO
  FOO_BAR3
  foo
  foo_bar
  _foo
  _foo_bar_
  
  # Invalid field names
  foo__bar
  3foo
  Foo::Bar

Variable Name

A variable name begins with $ and is followed by a symbol name.

The symbol name in a variable name can be surrounded by { and }.

Compilation Errors:

If a field names is invalid, a compilation error occurs.

If an opening { exists and the closing } dose not exist, a compilation error occurs.

Examples:

  # Variable names
  $name
  $my_name
  ${name}
  $Foo::name
  $Foo::Bar::name
  ${Foo::name}
  
  # Invalid variable names
  $::name
  $name::
  $Foo::::name
  $my__name
  ${name

Class Variable Name

A class variable name is a variable name.

Examples:

  # Class variable names
  $NAME
  $MY_NAME
  ${NAME}
  $FOO::NAME
  $FOO::BAR::NAME
  ${FOO::NAME_BRACE}
  $FOO::name
  
  # Invalid class variable names
  $::NAME
  $NAME::
  $FOO::::NAME
  $MY__NAME
  $3FOO
  ${NAME

Local Variable Name

A local variable name is a variable name without ::.

Examples:

  # Local variable names
  $name
  $my_name
  ${name_brace}
  $_name
  $NAME

  # Invalid local variable names
  $::name
  $name::
  $Foo::name
  $Foo::::name
  $my__name
  ${name
  $3foo

Keywords

The List of Keywords:

  alias
  allow
  as
  basic_type_id
  break
  byte
  can
  case
  cmp
  class
  compile_type_name
  copy
  default
  die
  div_uint
  div_ulong
  double
  dump
  elsif
  else
  enum
  eq
  eval
  eval_error_id
  extends
  for
  float
  false
  gt
  ge
  has
  if
  interface
  int
  interface_t
  isa
  isa_error
  isweak
  is_compile_type
  is_type
  is_error
  is_read_only
  args_width
  last
  length
  lt
  le
  long
  make_read_only
  my
  mulnum_t
  method
  mod_uint
  mod_ulong
  mutable
  native
  ne
  next
  new
  new_string_len
  of
  our
  object
  print
  private
  protected
  public
  precompile
  pointer
  return
  require
  required
  rw
  ro
  say
  static
  switch
  string
  short
  scalar
  true
  type_name
  undef
  unless
  unweaken
  use
  version
  void
  warn
  while
  weaken
  wo
  INIT
  __END__
  __PACKAGE__
  __FILE__
  __LINE__

Operator Tokens

The List of Operator Tokens:

  !
  !=
  $
  %
  &
  &&
  &=
  =
  ==
  ^
  ^=
  |
  ||
  |=
  -
  --
  -=
  ~
  @
  +
  ++
  +=
  *
  *=
  <
  <=
  >
  >=
  <=>
  %
  %=
  <<
  <<=
  >>=
  >>
  >>>
  >>>=
  .
  .=
  /
  /=
  \
  (
  )
  {
  }
  [
  ]
  ;
  :
  ,
  ->
  =>

Comment

Comments have no meaning.

  #COMMENT

A comment begins with #.

It is followed by any string COMMENT.

It ends with ASCII LF.

Line directives take precedence over comments.

File directives take precedence over comments.

Examples:

  # This is a comment line

Line Directive

A line directive set the current line number.

  #line NUMBER

A line directive begins with #line from the beggining of the line.

It is followed by one or more ASCII SP.

It is followed by NUMBER. NUMBER is a positive 32bit integer.

It ends with ASCII LF.

The current line number of the source code is set to NUMBER.

Line directives take precedence over comments.

Compilation Errors:

A line directive must begin from the beggining of the line. Otherwise an compilation error occurs.

A line directive must end with "\n". Otherwise an compilation error occurs.

A line directive must have a line number. Otherwise an compilation error occurs.

The line number given to a line directive must be a positive 32bit integer. Otherwise an compilation error occurs.

Examples:

  class MyClass {
    
    static method main : void () {
      
  #line 39
      
    }
  }

File Directive

A file directive set the current file path.

  #file "FILE_PATH"

A file directive begins from the beggining of the source code.

It is followed by one or more ASCII SP.

It is followed by ".

It is followed by FILE_PATH. FILE_PATH is a string that represetns a file path.

It is closed with ".

It ends with ASCII LF.

The current file path is set to FILE_PATH.

File directives take precedence over comments.

Compilation Errors:

A file directive must begin from the beggining of the source code. Otherwise an compilation error occurs.

A file directive must end with "\n". Otherwise an compilation error occurs.

A file directive must have a file path. Otherwise an compilation error occurs.

A file directive must end with ". Otherwise an compilation error occurs.

Examples:

  #file "/path/MyClass.spvm"
  class MyClass {
  
  }

__END__

If a line begins with __END__ and ends with ASCII LF, the line with __END__ and the below lines are interpreted as comments.

Examples:

  class MyClass {
    
  }
  
  __END__
  
  foo
  bar

POD

POD is a syntax to write multiline comment. POD has no meaning.

The Beginning of a POD:

  =NAME

The beginning of a POD begins with = from the beggining of the line.

It is followed by NAME. NAME is any string that begins with ASCII a-zA-Z.

It ends with ASCII LF.

The End of a POD:

  =cut

The end of a POD begins with = from the beggining of the line.

It is followed by cut.

It ends with ASCII LF.

Examples:

  =pod
  
  Comment1
  Comment2
  
  =cut
  
  =head1
  
  Comment1
  Comment2
  
  =cut

Fat Comma

A fat comma is

  =>

The fat comma is an alias for a comma ,.

  # Comma
  ["a", "b", "c", "d"]
  
  # Fat Comma
  ["a" => "b", "c" => "d"]

If the left operand of a fat comma is a symbol name without ::, it is wrraped by " and is treated as a string literal.

  # foo_bar2 is treated as "foo_bar2"
  [foo_bar2 => "Mark"]
  
  ["foo_bar2" => "Mark"]

Literals

A literal represents a constant value.

Numeric Literals

A numeric literal represents a constant number.

Integer Literals

A interger literal represents a constant number of an integer type.

Integer Literal Decimal Notation

The interger literal decimal notation represents a number of the int type or the long type using decimal numbers 0-9.

It can begin with a minus -.

It is followed by one or more of 0-9.

_ can be placed at the any positions after the first 0-9 as a separator. _ has no meaning.

It can end with the suffix L or l.

If the suffix L or l exists, the return type is the long type. Otherwise the return type is the int type.

Compilation Errors:

If the return type is the int type and the value is greater than the max value of the int type or less than the minimal value of the int type, a compilation error occurs.

If the return type is the long type and the value is greater than the max value of the long type or less than the minimal value of the long type, a compilation error occurs.

Examples:

  123
  -123
  123L
  123l
  123_456_789
  -123_456_789L

Integer Literal Hexadecimal Notation

The interger literal hexadecimal notation represents a number of the int type or the long type using hexadecimal numbers 0-9a-zA-Z.

It can begin with a minus -.

It is followed by 0x or 0X.

It is followed by one or more 0-9a-zA-Z. This is called hexadecimal numbers part.

_ can be placed at the any positions after 0x or 0X as a separator. _ has no meaning.

It can end with the suffix L or l.

If the suffix L or l exists, the return type is the long type. Otherwise the return type is the int type.

If the return type is the int type, the hexadecimal numbers part is interpreted as an unsigned 32 bit integer, and is converted to a signed 32-bit integer without changing the bits. For example, 0xFFFFFFFF is -1.

If the return type is the long type, the hexadecimal numbers part is interpreted as unsigned 64 bit integer, and is converted to a signed 64-bit integer without changing the bits. For example, 0xFFFFFFFFFFFFFFFFL is -1L.

Compilation Errors:

If the return type is the int type and the hexadecimal numbers part is greater than hexadecimal FFFFFFFF, a compilation error occurs.

If the return type is the long type and the hexadecimal numbers part is greater than hexadecimal FFFFFFFFFFFFFFFF, a compilation error occurs.

Examples:

  0x3b4f
  0X3b4f
  -0x3F1A
  0xDeL
  0xFFFFFFFF
  0xFF_FF_FF_FF
  0xFFFFFFFFFFFFFFFFL

Integer Literal Octal Notation

The interger literal octal notation represents a number of the int type or the long type using octal numbers 0-7.

It can begin with a minus -.

It is followed by 0.

It is followed by one or more 0-7. This is called octal numbers part.

_ can be placed at the any positions after 0 as a separator. _ has no meaning.

It can end with the suffix L or l.

If the suffix L or l exists, the return type is the long type. Otherwise the return type is the int type.

If the return type is the int type, the octal numbers part is interpreted as an unsigned 32 bit integer, and is converted to a signed 32-bit integer without changing the bits. For example, 037777777777 is -1.

If the return type is the long type, the octal numbers part is interpreted as unsigned 64 bit integer, and is converted to a signed 64-bit integer without changing the bits. For example, 01777777777777777777777L is -1L.

If the return type is the long type, the value that is except for - is interpreted as unsigned 64 bit integer uint64_t type in the C language, and the following conversion is performed.

Compilation Errors:

If the return type is the int type and the octal numbers part is greater than octal 37777777777, a compilation error occurs.

If the return type is the long type and the octal numbers part is greater than octal 1777777777777777777777, a compilation error occurs.

Examples:

  0755
  -0644
  0666L
  0655_755

Integer Literal Binary Notation

The interger literal binary notation represents a number of the int type or the long type using binary numbers 0 and 1.

It can begin with a minus -.

It is followed by 0b or 0B.

It is followed by one or more 0 and 1. This is called binary numbers part.

_ can be placed at the any positions after 0b or 0B as a separator. _ has no meaning.

It can end with the suffix L or l.

If the suffix L or l exists, the return type is the long type. Otherwise the return type is the int type.

If the return type is the int type, the binary numbers part is interpreted as an unsigned 32 bit integer, and is converted to a signed 32-bit integer without changing the bits. For example, 0b11111111111111111111111111111111 is -1.

If the return type is the long type, the binary numbers part is interpreted as unsigned 64 bit integer, and is converted to a signed 64-bit integer without changing the bits. For example, 0b1111111111111111111111111111111111111111111111111111111111111111L is -1L.

Compilation Errors:

If the return type is the int type and the value that is except for - is greater than binary 11111111111111111111111111111111, a compilation error occurs.

If the return type is the long type and the value that is except for - is greater than binary 1111111111111111111111111111111111111111111111111111111111111111, a compilation error occurs.

Examples:

  0b0101
  -0b1010
  0b110000L
  0b10101010_10101010

Floating Point Literals

The floating point litral represetns a floating point number.

Floating Point Literal Decimal Notation

The floating point litral decimal notation represents a number of the float type and the double type using decimal numbers 0-9.

It can begin with a minus -.

It is followed by one or more 0-9.

_ can be placed at the any positions after the first 0-9.

It can be followed by a floating point part, an exponent part, or a combination of a floating point part and an exponent part.

[Floating Point Part Begin]

A floating point part begins with ..

It is followed by one or more 0-9.

[Floating Point Part End]

[Exponent Part Begin]

An exponent part begins with e or E.

It can be followed by + or -

It is followed by one or more 0-9.

[Exponent Part End]

A floating point litral decimal notation can end with a suffix f, F, d, or D.

If a suffix does not exists, a floating point litral decimal notation must have a floating point part or an exponent part.

If the suffix f or F exists, the return type is the float type. Otherwise the return type is the double type.

Compilation Errors:

If the return type is the float type, the floating point litral decimal notation without the suffix must be able to be parsed by the strtof function in the C language. Otherwise, a compilation error occurs.

If the return type is the double type, the floating point litral decimal notation without the suffix must be able to be parsed by the strtod function in the C language. Otherwise, a compilation error occurs.

Examples:

  1.32
  -1.32
  1.32f
  1.32F
  1.32d
  1.32D
  1.32e3
  1.32e-3
  1.32E+3
  1.32E-3
  1.32e3f
  12e7

Floating Point Literal Hexadecimal Notation

The floating point litral hexadecimal notation represents a number of the float type and the double type using hexadecimal numbers 0-9a-zA-Z.

It can begin with a minus -.

It is followed by 0x or 0X.

It is followed by one or more 0-9a-zA-Z.

_ can be placed at the any positions after 0x or 0X.

It can be followed by a floating point part, an exponent part, or a combination of a floating point part and an exponent part.

[Floating Point Part Begin]

A floating point part begins with .

It is followed by one or more 0-9a-zA-Z.

[Floating Point Part End]

[Exponent Part Begin]

An exponent part begins with p or P.

It can be followed by + or -.

It is followed by one or more 0-9.

[Exponent Part End]

A floating point litral hexadecimal notation can end with a suffix f, F, d, or D.

If a suffix does not exists, a floating point litral hexadecimal notation must have a floating point part or an exponent part.

Compilation Errors:

If the return type is the float type, the floating point litral hexadecimal notation without the suffix must be able to be parsed by the strtof function in the C language. Otherwise, a compilation error occurs.

If the return type is the double type, thefloating point litral hexadecimal notation without the suffix must be able to be parsed by the strtod function in the C language. Otherwise, a compilation error occurs.

Examples:

  0x3d3d.edp0
  0x3d3d.edp3
  0x3d3d.edP3
  0x3d3d.edP+3
  0x3d3d.edP-3f
  0x3d3d.edP-3F
  0x3d3d.edP-3d
  0x3d3d.edP-3D
  0x3d3dP+3

Bool Literals

The bool literal represents a bool object.

true

true is the alias for Bool#TRUE.

  true

Examples:

  # true
  my $bool_object_true = true;

false

false is the alias for Bool#FALSE.

  false

Examples:

  # false
  my $bool_object_false = false;

Character Literal

A character literal represents a number of the byte type that normally represents an ASCII character.

It begins with '.

It is followed by a printable ASCII character 0x20-0x7e or an character literal escape character.

It ends with '.

The return type is the byte type.

Compilation Errors:

If the format of the character literal is invalid, a compilation error occurs.

Character Literal Escape Characters

The List of Character Literal Escape Characters:

Character Literal Escape Characters Values
\a 0x07 BEL
\t 0x09 HT
\n 0x0A LF
\f 0x0C FF
\r 0x0D CR
\" 0x22 "
\' 0x27 '
\\ 0x5C \
Octal Escape Character A number represented by an octal escape character
Hexadecimal Escape Character A number represented by a hexadecimal escape character

The type of every character literal escape character is the byte type.

Examples:

  # Charater literals
  'a'
  'x'
  '\a'
  '\t'
  '\n'
  '\f'
  '\r'
  '\"'
  '\''
  '\\'
  ' '
  '\0'
  '\012'
  '\377'
  '\o{1}'
  '\xab'
  '\xAB'
  '\x0D'
  '\x0A'
  '\xD'
  '\xA'
  '\xFF'
  '\x{A}'

Octal Escape Character

The octal escape character represents an unsined 8-bit integer using octal numbers 0-7.

The octal escape character is a part of a string literal and a character literal.

It begins with \0, \1, \2, \3, \4, \5, \6, \7, or \o{.

If it begins with \0, \1, \2, \3, \4, \5, \6, or \7, it is followed by one to two 0-7.

If it begins with \o{, it is followed by one to three 0-7, and ends with }.

The octal numbers after \ or \o{ is called octal numbers part.

Octal numbers part is interpreted as an unsined 8-bit integer, and is converted to a number of the byte type without changing the bits.

Compilation Errors:

The octal numbers part must be less than or equal to 377. Otherwise a compilation error occurs.

If an octal escape character begins with \o{, the close } must exist. Otherwise a compilation error occurs.

Examples:

  # Octal escape characters
  \0
  \01
  \03
  \012
  \001
  \077
  \377
  \o{1}
  \o{12}

Hexadecimal Escape Character

The hexadecimal escape character represents an unsined 8-bit integer using hexadecimal numbers 0-9a-fA-F.

The hexadecimal escape character is a part of a string literal and a character literal.

The hexadecimal escape character begins with \x.

It can be followed by {.

It is followed by one or two 0-9a-fA-F. This is called hexadecimal numbers part.

If it contains {, it must be followed by }.

Hexadecimal numbers part is interpreted as an unsined 8-bit integer, and is converted to a number of the byte type without changing the bits.

Compilation Errors:

If the format of the hexadecimal escape character is invalid, a compilation error occurs.

Examples:

  # Hexadecimal escape characters
  \xab
  \xAB
  \x0D
  \x0A
  \xD
  \xA
  \xFF
  \x{A}

String Literal

A string literal represents a constant string.

A string literal begins with ".

It is followed by zero or more UTF-8 characters, string literal escape characters, or variable expansions.

It ends with ".

The return type is the string type.

Compilation Errors:

If the format of the string literal is invalid, a compilation error occurs.

Examples:

  # String literals
  ""
  "abc";
  "あいう"
  "hello\tworld\n"
  "hello\x0D\x0A"
  "hello\xA"
  "hello\x{0A}"
  "hello\0"
  "hello\012"
  "hello\377"
  "AAA $foo BBB"
  "AAA $FOO BBB"
  "AAA $$foo BBB"
  "AAA $foo->{x} BBB"
  "AAA $foo->[3] BBB"
  "AAA $foo->{x}[3] BBB"
  "AAA $@ BBB"
  "\N{U+3042}\N{U+3044}\N{U+3046}"

String Literal Escape Characters

The List of String Literal Escape Characters:

String Literal Escape Characters Values
\a 0x07 BEL
\t 0x09 HT
\n 0x0A LF
\f 0x0C FF
\r 0x0D CR
\" 0x22 "
\$ 0x24 $
\' 0x27 '
\\ 0x5C \
Octal Escape Character A number represented by an octal escape character
Hexadecimal Escape Character A number represented by a hexadecimal escape character
A Unicode escape character Numbers represented by an Unicode escape character
A raw escape character Numbers represented by a hexadecimal escape character

The type of every string literal escape character ohter than the Unicode escape character and the raw escape character is the byte type.

The type of each number contained in the Unicode escape character and the raw escape character is the byte type.

Unicode Escape Character

The Unicode escape character represents an UTF-8 character.

An UTF-8 character is represented by an Unicode code point with hexadecimal numbers 0-9a-fA-F.

This is one to four numbers of the byte type.

The Unicode escape character is a part of a string literal.

It begins with \N{U+.

It is followed by one or more 0-9a-fA-F. This is called code point part.

It ends with }.

Compilation Errors:

If a code point part is not a Unicode scalar value, a compilation error occurs.

Examples:

  # Unicode escape characters
  
  # あ
  \N{U+3042}
  
  # い
  \N{U+3044}
  
  # う
  \N{U+3046}"

Raw Escape Characters

A raw escape character is an escapa character that <\> is interpreted as ASCII \ and the following character is interpreted as itself.

For example, a raw escape character \s is ASCII chracters \s.

A raw escape character is a part of a string literal.

The List of Raw Escape Characters:

Raw Escape Characters
\!
\#
\%
\&
\(
\)
\*
\+
\,
\-
\.
\/
\:
\;
\<
\=
\>
\?
\@
\A
\B
\D
\G
\H
\K
\N
\P
\R
\S
\V
\W
\X
\Z
\[
\]
\^
\_
\`
\b
\d
\g
\h
\k
\p
\s
\v
\w
\z
\{
\|
\}
\~

Variable Expansion

The variable expasion is a syntax to embed getting a local variable, getting a class variables, a dereference, getting a field, getting an array element, getting the exception variable into a string literal.

  "AAA $foo BBB"
  "AAA $FOO BBB"
  "AAA $$foo BBB"
  "AAA $foo->{x} BBB"
  "AAA $foo->[3] BBB"
  "AAA $foo->{x}[3] BBB"
  "AAA $foo->{x}->[3] BBB"
  "AAA $@ BBB"
  "AAA ${foo}BBB"

The above codes are expanded to the following codes.

  "AAA " . $foo . " BBB"
  "AAA " . $FOO . " BBB"
  "AAA " . $$foo . " BBB"
  "AAA " . $foo->{x} . " BBB"
  "AAA " . $foo->[3] . " BBB"
  "AAA " . $foo->{x}[3] . " BBB"
  "AAA " . $foo->{x}->[3] . " BBB"
  "AAA " . $@ . "BBB"
  "AAA " . ${foo} . "BBB"

The operation of getting field does not contain space characters between { and }.

The index of getting array element must be a constant interger.

The getting array dose not contain space characters between [ and ].

The end $ is interpreted by $, not interpreted as a variable expansion.

  # AAA$
  "AAA$"

Single-Quoted String Literal

A single-quoted string literal represents a constant string without variable expansions with a few escape characters.

It begins with q'.

It is followed by zero or more UTF-8 characters, or single-quoted string literal escape characters.

It ends with '.

The return type is the string type.

Compilation Errors:

A single-quoted string literal must be end with '. Otherwise a compilation error occurs.

If the escape character in a single-quoted string literal is invalid, a compilation error occurs.

Examples:

  # Single-quoted string literals
  q'abc';
  q'abc\'\\';

Single-Quoted String Literal Escape Characters

The List of Single-Quoted String Literal Escape Characters:

Single-Quoted String Literal Escape Characters Values
\' 0x27 '
\\ 0x5C \

The type of every single-quoted string literal escape character is the byte type.

Here Document

A here document represents a constant string in multiple lines without escape characters and variable expansions.

  <<'HERE_DOCUMENT_NAME';
  LINE1
  LINE2
  LINEn
  HERE_DOCUMENT_NAME

A here document begins with <<'HERE_DOCUMENT_NAME'; and ASCII LF.

HERE_DOCUMENT_NAME is a here document name.

It is followed by a string in multiple lines.

It ends with HERE_DOCUMENT_NAME from the beginning of a line and ASCII LF.

Compilation Errors:

<<'HERE_DOCUMENT_NAME'; must not contain space characters. Otherwise a compilation error occurs.

Examples:

  # Here document
  my $string = <<'EOS';
  Hello
  World
  EOS

Here Document Name

A here document name consist of a-z, A-Z, _, 0-9.

The length of a here document name is greater than or equal to 0.

A here document name cannot begin with 0-9.

A here document name cannot contain __.

Compilaition Errors:

If the format of a here document name is invalid, a compilatio error occurs.

See Also

Copyright & License

Copyright (c) 2023 Yuki Kimoto

MIT License