亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频

Lexer and parser generators (ocamllex, ocaml

系統(tǒng) 1645 0

Chapter 12 Lexer and parser generators (ocamllex, ocamlyacc)


This chapter describes two program generators: ocamllex , that produces a lexical analyzer from a set of regular expressions with associated semantic actions, and ocamlyacc , that produces a parser from a grammar with associated semantic actions.

These program generators are very close to the well-known lex and yacc commands that can be found in most C programming environments. This chapter assumes a working knowledge of lex and yacc : while it describes the input syntax for ocamllex and ocamlyacc and the main differences with lex and yacc , it does not explain the basics of writing a lexer or parser description in lex and yacc . Readers unfamiliar with lex and yacc are referred to ``Compilers: principles, techniques, and tools'' by Aho, Sethi and Ullman (Addison-Wesley, 1986), or ``Lex & Yacc'', by Levine, Mason and Brown (O'Reilly, 1992).

12.1 Overview of ocamllex


The ocamllex command produces a lexical analyzer from a set of regular expressions with attached semantic actions, in the style of lex . Assuming the input file is lexer .mll , executing
            ocamllex 
    
      lexer
    
    .mll
  
produces Caml code for a lexical analyzer in file lexer .ml . This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.

Lexer buffers are an abstract data type implemented in the standard library module Lexing . The functions Lexing.from_channel , Lexing.from_string and Lexing.from_function create lexer buffers that read from an input channel, a character string, or any reading function, respectively. (See the description of module Lexing in chapter 20 .)

When used in conjunction with a parser generated by ocamlyacc , the semantic actions compute a value belonging to the type token defined by the generated parsing module. (See the description of ocamlyacc below.)

12.2 Syntax of lexer definitions


The format of lexer definitions is as follows:
    { 
    
      header
    
     }

let 
    
      ident
    
     = 
    
      regexp
    
     ...

rule 
    
      entrypoint
    
     =

  parse 
    
      regexp
    
     { 
    
      action
    
     }

      | ...

      | 
    
      regexp
    
     { 
    
      action
    
     }

and 
    
      entrypoint
    
     =

  parse ...

and ...

{ 
    
      trailer
    
     }
  
Comments are delimited by (* and *) , as in Caml.

12.2.1 Header and trailer

The header and trailer sections are arbitrary Caml text enclosed in curly braces. Either or both can be omitted. If present, the header text is copied as is at the beginning of the output file and the trailer text at the end. Typically, the header section contains the open directives required by the actions, and possibly some auxiliary functions used in the actions.

12.2.2 Naming regular expressions


Between the header and the entry points, one can give names to frequently-occurring regular expressions. This is written let ident = ? regexp . In following regular expressions, the identifier ident can be used as shorthand for regexp .

12.2.3 Entry points


The names of the entry points must be valid identifiers for Caml values (starting with a lowercase letter). Each entry point becomes a Caml function that takes one argument of type Lexing.lexbuf . Characters are read from the Lexing.lexbuf argument and matched against the regular expressions provided in the rule, until a prefix of the input matches one of the rule. The corresponding action is then evaluated and returned as the result of the function.

If several regular expressions match a prefix of the input, the ``longest match'' rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.

12.2.4 Regular expressions


The regular expressions are in the style of lex , with a more Caml-like syntax.
' char '
A character constant, with the same syntax as Objective Caml character constants. Match the denoted character.

_
(Underscore.) Match any character.

eof
Match the end of the lexer input.
Note: On some systems, with interactive input, an end-of-file may be followed by more characters. However, ocamllex will not correctly handle regular expressions that contain eof followed by something else.

" string "
A string constant, with the same syntax as Objective Caml string constants. Match the corresponding sequence of characters.

[ character-set ]
Match any single character belonging to the given character set. Valid character sets are: single character constants ' c ' ; ranges of characters ' c 1 ' - ' ? c 2 ' (all characters between c 1 and c 2 , inclusive); and the union of two or more character sets, denoted by concatenation.

[ ^ character-set ]
Match any single character not belonging to the given character set.

regexp *
(Repetition.) Match the concatenation of zero or more strings that match regexp .

regexp +
(Strict repetition.) Match the concatenation of one or more strings that match regexp .

regexp ?
(Option.) Match either the empty string, or a string matching regexp .

regexp 1 | regexp 2
(Alternative.) Match any string that matches either regexp 1 or regexp 2

regexp 1 ? regexp 2
(Concatenation.) Match the concatenation of two strings, the first matching regexp 1 , the second matching regexp 2 .

( regexp )
Match the same strings as regexp .

ident
Reference the regular expression bound to ident by an earlier let ident = regexp definition.
Concerning the precedences of operators, * and + have highest precedence, followed by ? , then concatenation, then | (alternation).

12.2.5 Actions


The actions are arbitrary Caml expressions. They are evaluated in a context where the identifier lexbuf is bound to the current lexer buffer. Some typical uses for lexbuf , in conjunction with the operations on lexer buffers provided by the Lexing standard library module, are listed below.
Lexing.lexeme lexbuf
Return the matched string.

Lexing.lexeme_char lexbuf n
Return the n th character in the matched string. The first character corresponds to n = 0.

Lexing.lexeme_start lexbuf
Return the absolute position in the input text of the beginning of the matched string. The first character read from the input text has position 0.

Lexing.lexeme_end lexbuf
Return the absolute position in the input text of the end of the matched string. The first character read from the input text has position 0.

entrypoint lexbuf
(Where entrypoint is the name of another entry point in the same lexer definition.) Recursively call the lexer on the given entry point. Useful for lexing nested comments, for example.

12.2.6 Reserved identifiers


All identifiers starting with __ocaml_lex are reserved for use by ocamllex ; do not use any such identifier in your programs.

12.3 Overview of ocamlyacc


The ocamlyacc command produces a parser from a context-free grammar specification with attached semantic actions, in the style of yacc . Assuming the input file is grammar .mly , executing
            ocamlyacc 
    
      options
    
    
      grammar
    
    .mly
  
produces Caml code for a parser in the file grammar .ml , and its interface in file grammar .mli .

The generated module defines one parsing function per entry point in the grammar. These functions have the same names as the entry points. Parsing functions take as arguments a lexical analyzer (a function from lexer buffers to tokens) and a lexer buffer, and return the semantic attribute of the corresponding entry point. Lexical analyzer functions are usually generated from a lexer specification by the ocamllex program. Lexer buffers are an abstract data type implemented in the standard library module Lexing . Tokens are values from the concrete type token , defined in the interface file grammar .mli produced by ocamlyacc .

12.4 Syntax of grammar definitions


Grammar definitions have the following format:
    %{

  
    
      header
    
    

%}

  
    
      declarations
    
    

%%

  
    
      rules
    
    

%%

  
    
      trailer
    
  
Comments are enclosed between /* and */ (as in C) in the ``declarations'' and ``rules'' sections, and between (* and *) (as in Caml) in the ``header'' and ``trailer'' sections.

12.4.1 Header and trailer


The header and the trailer sections are Caml code that is copied as is into file grammar .ml . Both sections are optional. The header goes at the beginning of the output file; it usually contains open directives and auxiliary functions required by the semantic actions of the rules. The trailer goes at the end of the output file.

12.4.2 Declarations


Declarations are given one per line. They all start with a % sign.
%token symbol ... symbol
Declare the given symbols as tokens (terminal symbols). These symbols are added as constant constructors for the token concrete type.
?
%token < type > symbol ... symbol
Declare the given symbols as tokens with an attached attribute of the given type. These symbols are added as constructors with arguments of the given type for the token concrete type. The type part is an arbitrary Caml type expression, except that all type constructor names must be fully qualified (e.g. Modname.typename ) for all types except standard built-in types, even if the proper open directives (e.g. open Modname ) were given in the header section. That's because the header is copied only to the .ml output file, but not to the .mli output file, while the type part of a %token declaration is copied to both.
?
%start symbol ... symbol
Declare the given symbols as entry points for the grammar. For each entry point, a parsing function with the same name is defined in the output module. Non-terminals that are not declared as entry points have no such parsing function. Start symbols must be given a type with the %type directive below.
?
%type < type > symbol ... symbol
Specify the type of the semantic attributes for the given symbols. This is mandatory for start symbols only. Other nonterminal symbols need not be given types by hand: these types will be inferred when running the output files through the Objective Caml compiler (unless the -s option is in effect). The type part is an arbitrary Caml type expression, except that all type constructor names must be fully qualified, as explained above for %token .
?
%left symbol ... symbol
%right symbol ... symbol
%nonassoc symbol ... symbol
Associate precedences and associativities to the given symbols. All symbols on the same line are given the same precedence. They have higher precedence than symbols declared before in a %left , %right or %nonassoc line. They have lower precedence than symbols declared after in a %left , %right or %nonassoc line. The symbols are declared to associate to the left ( %left ), to the right ( %right ), or to be non-associative ( %nonassoc ). The symbols are usually tokens. They can also be dummy nonterminals, for use with the %prec directive inside the rules.

12.4.3 Rules


The syntax for rules is as usual:
    
      nonterminal
    
     :

    
    
      symbol
    
     ... 
    
      symbol
    
     { 
    
      semantic-action
    
     }

  | ...

  | 
    
      symbol
    
     ... 
    
      symbol
    
     { 
    
      semantic-action
    
     }

;
  
Rules can also contain the %prec symbol directive in the right-hand side part, to override the default precedence and associativity of the rule with the precedence and associativity of the given symbol.

Semantic actions are arbitrary Caml expressions, that are evaluated to produce the semantic attribute attached to the defined nonterminal. The semantic actions can access the semantic attributes of the symbols in the right-hand side of the rule with the $ notation: $1 is the attribute for the first (leftmost) symbol, $2 is the attribute for the second symbol, etc.

The rules may contain the special symbol error to indicate resynchronization points, as in yacc .

Actions occurring in the middle of rules are not supported.

Nonterminal symbols are like regular Caml symbols, except that they cannot end with ' (single quote).

12.4.4 Error handling


Error recovery is supported as follows: when the parser reaches an error state (no grammar rules can apply), it calls a function named parse_error with the string "syntax error" as argument. The default parse_error function does nothing and returns, thus initiating error recovery (see below). The user can define a customized parse_error function in the header section of the grammar file.

The parser also enters error recovery mode if one of the grammar actions raises the Parsing.Parse_error exception.

In error recovery mode, the parser discards states from the stack until it reaches a place where the error token can be shifted. It then discards tokens from the input until it finds three successive tokens that can be accepted, and starts processing with the first of these. If no state can be uncovered where the error token can be shifted, then the parser aborts by raising the Parsing.Parse_error exception.

Refer to documentation on yacc for more details and guidance in how to use error recovery.

12.5 Options


The ocamlyacc command recognizes the following options:
-v
Generate a description of the parsing tables and a report on conflicts resulting from ambiguities in the grammar. The description is put in file grammar .output .

-b prefix
Name the output files prefix .ml , prefix .mli , prefix .output , instead of the default naming convention.
At run-time, the ocamlyacc -generated parser can be debugged by setting the p option in the OCAMLRUNPARAM environment variable (see section 10.2 ). This causes the pushdown automaton executing the parser to print a trace of its action (tokens shifted, rules reduced, etc). The trace mentions rule numbers and state numbers that can be interpreted by looking at the file grammar .output generated by ocamlyacc -v .

12.6 A complete example


The all-time favorite: a desk calculator. This program reads arithmetic expressions on standard input, one per line, and prints their values. Here is the grammar definition:
            /* File parser.mly */

        %token <int> INT

        %token PLUS MINUS TIMES DIV

        %token LPAREN RPAREN

        %token EOL

        %left PLUS MINUS        /* lowest precedence */

        %left TIMES DIV         /* medium precedence */

        %nonassoc UMINUS        /* highest precedence */

        %start main             /* the entry point */

        %type <int> main

        %%

        main:

            expr EOL                { $1 }

        ;

        expr:

            INT                     { $1 }

          | LPAREN expr RPAREN      { $2 }

          | expr PLUS expr          { $1 + $3 }

          | expr MINUS expr         { $1 - $3 }

          | expr TIMES expr         { $1 * $3 }

          | expr DIV expr           { $1 / $3 }

          | MINUS expr %prec UMINUS { - $2 }

        ;
  
Here is the definition for the corresponding lexer:
            (* File lexer.mll *)

        {

        open Parser        (* The type token is defined in parser.mli *)

        exception Eof

        }

        rule token = parse

            [' ' '\t']     { token lexbuf }     (* skip blanks *)

          | ['\n' ]        { EOL }

          | ['0'-'9']+     { INT(int_of_string(Lexing.lexeme lexbuf)) }

          | '+'            { PLUS }

          | '-'            { MINUS }

          | '*'            { TIMES }

          | '/'            { DIV }

          | '('            { LPAREN }

          | ')'            { RPAREN }

          | eof            { raise Eof }
  
Here is the main program, that combines the parser with the lexer:
            (* File calc.ml *)

        let _ =

          try

            let lexbuf = Lexing.from_channel stdin in

            while true do

              let result = Parser.main Lexer.token lexbuf in

                print_int result; print_newline(); flush stdout

            done

          with Lexer.Eof ->

            exit 0
  
To compile everything, execute:
            ocamllex lexer.mll       # generates lexer.ml

        ocamlyacc parser.mly     # generates parser.ml and parser.mli

        ocamlc -c parser.mli

        ocamlc -c lexer.ml

        ocamlc -c parser.ml

        ocamlc -c calc.ml

        ocamlc -o calc lexer.cmo parser.cmo calc.cmo
  

12.7 Common errors

ocamllex: transition table overflow, automaton is too big


The deterministic automata generated by ocamllex are limited to at most 32767 transitions. The message above indicates that your lexer definition is too complex and overflows this limit. This is commonly caused by lexer definitions that have separate rules for each of the alphabetic keywords of the language, as in the following example.
        rule token = parse

  "keyword1"   { KWD1 }

| "keyword2"   { KWD2 }

| ...

| "keyword100" { KWD100 }

| ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] *

               { IDENT(Lexing.lexeme lexbuf) }
      
To keep the generated automata small, rewrite those definitions with only one general ``identifier'' rule, followed by a hashtable lookup to separate keywords from identifiers:
        { let keyword_table = Hashtbl.create 53

  let _ =

    List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok)

              [ "keyword1", KWD1;

                "keyword2", KWD2; ...

                "keyword100", KWD100 ]

}

rule token = parse

  ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] *

               { let id = Lexing.lexeme lexbuf in

                 try

                   Hashtbl.find keyword_table s

                 with Not_found ->

                   IDENT s }
      

Lexer and parser generators (ocamllex, ocamlyacc)


更多文章、技術(shù)交流、商務合作、聯(lián)系博主

微信掃碼或搜索:z360901061

微信掃一掃加我為好友

QQ號聯(lián)系: 360901061

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。

【本文對您有幫助就好】

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描上面二維碼支持博主2元、5元、10元、自定義金額等您想捐的金額吧,站長會非常 感謝您的哦!!!

發(fā)表我的評論
最新評論 總共0條評論
主站蜘蛛池模板: 亚洲视频精品 | 欧洲a老妇女黄大片 | 久久免费国产 | 欧美在线观看一区 | 咪咪色网 | 免费一级真人毛片 | 成人久久18免费网站游戏 | 中文字幕国产 | 欧美爱爱视频网站 | 在线观看高清国产福利视频 | 伊人三区| 精品视频在线免费看 | 成人97在线观看免费高清 | 国产99区 | 狠狠色噜噜狠狠狠狠97不卡 | 99香蕉视频| 久在线精品视频 | 免费日韩毛片 | 欧亚在线视频 | 在线亚洲精品国产波多野结衣 | 国内成人免费视频 | 四虎永久地址入口 | 四虎影院紧急入口 | 亚洲精品一区二区久久这里 | 久草在线首页 | 九九视频九九 | 中文精品久久久久国产网站 | 成人午夜久久精品 | 成在线人免费视频一区二区三区 | 日本久操视频 | 国产一级在线免费观看 | 九九热这里有精品 | 国内精品久久久久影院不卡 | 日韩成人伦理 | 色婷婷精品综合久久狠狠 | 亚洲综合资源 | 玖玖国产在线 | 天天射日日干 | 波多野结衣免费一区二区三区香蕉 | 久久青草国产免费观看 | 99热久久国产精品这里小说 |