WO2016027170A2

WO2016027170A2 - Lexical analysis tool

Info

Publication number: WO2016027170A2
Application number: PCT/IB2015/002222
Authority: WO
Inventors: Isaiah Pinchas KANTOROVITZ
Original assignee: Kantorovitz Isaiah Pinchas
Priority date: 2015-12-04
Filing date: 2015-12-04
Publication date: 2016-02-25
Also published as: WO2016027170A3

Abstract

This paper provides an algorithm for constructing a lexical analyses tool, by different means than the UNIX Lex tool. The input is a keywords table, describing the target language's keywords, keysymbols, and their semantics, instead of using regular expressions to do so. The output is a lexical analyzer for the specific programming language. The tool can also be used as a translator engine by inputting a dictionary table, and as a pattern recognizer Keywords Compiler, Lexical Analysis, Scanner, Algorithm, Software Tool.

Description

Lexical Analysis Tool

Abstract

This paper provides an algorithm for constructing a lexical analysis tool, by different means than the UNIX Lex tool. The input is a keywords table, describing the target language 's keywords, keysymbols, and their semantics, instead of using regular expressions to do so.

The output is a lexical analyzer for the specific programming language. The tool can also be used as a translator engine by inputing a dictionary table, and as a pattern recognizer.

Keywords Compiler, Lexical Analysis, Scanner, Algorithm, Software Tool

1 Introduction

It is convenient to regard source program statements as a sequence of tokens rather than simply as a string of characters. Tokens may be thought of as the fundamental building blocks of the language. For example, a token might be a keyword, a variable name, an integer, an arithmetic operator etc. The task of scanning the source statement, recognizing and classifying the various tokens, is known as lexical analysis. The part of the compiler that performs this analytic function is commonly called the scanner. After the token scan, each statement in the program must be recognized as some language construct, such as a declaration or an assignment statement, described by the grammar. This process which is called parsing, is performed by a part of the compiler that is usually called parser. (See [4] for a simple construction)

There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing: simpler design, compiler efficiency is improved, compiler portability is enhanced.

Regular expressions, a mathematical logic tool, was soon introduced in order to specify the tokens of a given programming language. Since the theory of regular expressions is dual to that of finite state automata, both were used - the first - to specify tokens, the latter - to describe the process of identifying tokens.

It was quickly appreciated that tools to build lexical analyzers from regular expressions specifications would be useful in the implementation of compilers. Lex (UNIX) is an example. A lexical analyzer created by Lex behaves in concert with the parser. By changing the regular expressions inputed into Lex, we get different lexical analyzers for different programming languages (See [3]) . Further details can be found in Chapter 3 of [1] and [2] .

This paper is about constructing a lex-like tool, but from a different approach. It can be argued that the introduction of the finite state automata and regular expression model is justified for a text like: AAAAAAB ABABA...

when we are searching for patterns like: ABB

It is not justified when we try to analyze a statement like: COVARIENCE := 50;

Here we want an analysis:

ID, ASSIGN, NUM ;

General pattern recognition might recognize the keyword "VAR" inside "COVARIANCE" which is an ID.

It can be further argued, that while it is very justified to introduce the grammar model to the parsing phase, the lexical analysis phase is only complicated by general pattern recognition models like automata (see [5] and [6] ) .

This paper is interested in constructing a lex-like tool that would take advantage of the uniformity, or limited scope and vocabulary of programming languages. Thus we do not need a very general tool, but one who will tolerate minor differences between known programming languages, and will let the programmer define any new language within the normal variance (which is comfortably large) .

2 Algorithm

We begin with five preliminary steps.

1. The user of lex.cpp enters in to a file named "keywords. cpp" all the reserved keywords and meaningful keysymbols of the language in the form:

keysymbolj, blank, semantics-of-keysymbol_i, (i = 1 . . . n) .

2. The numbers and id's are considered uniformly defined for all programming languages:

NUM = DIGITS, OPTIONAL-FRACTION, OPTIONAL-EXPONENT.

ID = LETTER and (LETTER or DIGIT or NON-KEY-SYMBOL SYMBOL) *

Therefore they are taken care of inside the source program but can be modified from there. Exponent symbol will be defined inside the source program. Again this can be easily modified inside the source program.

3. Comment symbols are not keywords - they are defined in the beginning of the source program and can be modified from there.

4. Program to be analyzed must be in file named "program. cpp" . It must have blanks between alphabetical tokens (as is normal practice of program writers) .

Example:

IF X = "IF" is a keyword.

IFX = "IFX" is an id.

5. Non-alphabetical tokens i.e. keysymbols are of length < 2.

Example: "==" and are keysymbols.

This restriction can be modified within the source program. Blanks between non-alphabetical and alphabetical tokens are optional.

Example:

x:=5+u; is equal to x := 5 + u ; Data Flow:

• mainQ opens "keywords" file. It reads the keywords and keysymbols (skipping comments) , and inserts them into a keyword table.

• It then calls filler (). filler () is a "blank manager" . It puts blanks between non-alphabetical tokens. Blanks between alphabetical tokens exist according to step 4.

• The method is like putting two buckets: "program" and "prog" files (see [7] ) .

— We take 2 characters (si and s2) from "program" (add s3,...,sn for expanding step 5) .

— If the first is non-alphabetical, we try to match si and s2 with keywords table, (for example

— If we fail, we throw the second character, s2, back into "program" bucket, and we try to match si. (for example "+" ).

— If we fail, we just throw the character into the second bucket named "prog" .

— If we succeed with the match, we glue a blank on each side of the match and throw it into the second bucket "prog" .

— We exclude dot (.) from the process since it might be a decimal point and we do not want to separate a number with blanks around the decimal point.

— We continue till EOF.

The result is file "prog" , with blanks between all the tokens, except maybe rear punctuation marks (.;,:) namely dot. lexerQ takes care of them.

• Now we call lexer(). It fills the token table simply by fetching strings till the blanks. The suffix punctuation is separated while checking one character backwards that it's not two dots etc.

• Now we call comparQ . It compares the token table with the keyword table and gives lexical analysis results. The method is sequential search:

— We first search the token table for keywords using the keywords table. Then we search the remaining as id's to find multi-used id's.

— Then we search the remaining as numbers:

— If (length as a string) = (number of digits) integer.

— If (length as a string) = (number of digits + 1,2 or 3) real,

(because of decimal points and exponent symbols) .

— The remaining are id's.

References

[1] Aho, Sethi and Ulman, Compilers - Principles, Techniques and Tools, Addison- Wesley, Reading Massachusettes, 1986.

[2] Seppo Sippu and Eljas Soisalon Soininen, Parsing Theory Vol I, Springer- Verlag, Berlin 1988.

[3] John R. Levine, Tony Mason and Doug Brown, Unix Programming Tools - Lex and Yacc, O'Reilly and Associates Inc. , California 1992. [4] Leland L. Beck, System Software, Addison- Wesley, Reading Massachusettes 1990.

[5] Allen Holub, Compiler Design in C, Prentice-Hall, Englewood Cliffs New Jersey 1990.

[6] J. Heering P and Klint J. Rekers, Incremental Generation of Lexical Scanners, ACM Transactions on Programming Languages and Systems 14(4) (1992) , 490-520.

[7] W. Yang, On the Look-Ahead Problem in Lexical Analysis, Acta Informatica 32(5) (1995), 459- 476.

#include <stdio.h>

#include <string.h>

#include <ctype.h>

#include <math.h>

#define T0KENSIZE 25 /* max size of keyword or token */

#define MAX 100 /* max number of tokens we can handle */ char token [TOKENSIZE] ; /* token reading buffer */

/* keyword table */

struct kw { char keyword [TOKENSIZE] ;

int len;

int line;

char code [TOKENSIZE] ;

int numcode;

} kwtab[MAX] ;

int numl /* stores number of entries in keyword table */

FILE *fl /* file descriptor for keywords file */

FILE *f2 /* file descriptor for sample program file */

FILE *f3 /* file descriptor for intermediate file */ int filler () ;

int lexer ();

int compar () ; mainO

/* fills keyword table */

{

int i,j,k,t,w,s;

/* open files for reading */

fl = fopen("keywords . cpp" , "r");

f2 = fopen("program. cpp" , "r");

f3 = fopen("interlexreslt . cpp" , "w") ;

t = getc(fl);

while (t != EOF)

{

if (t == '\η') i++; /* increment line number */

else if(t == ' ' | | t == '\t ' ) ; /* skip blanks */

else

{ ungetc(t, f1) ;

fscanf(fl, "°/.s", token);

/* insert keyword into keywords table (kwtab) */ fscanf (fl,"°/.s", kwtab [s] .code);

s++;

}

ww: t = getc(f1) ;

}

numl = s;

fclose (f3) ;

fillerO ;

lexer () ;

compar () ;

return(0) ;

}

int fillerO

/* puts blanks between non-alphabetical tokens of program.

stores the results inside file prog. */

/* assumption: length(non-alphabetical tokens) <= 2 in all languages. i.e. = * <= .. etc .

but we do not have *** === !=== etc.

this can be modified by using s3 in addition to si and s2 as scanner */

{

/* si and s2 store one character each, we inch through the file f2 by scanning and analyzing si and s2 each time i.e. we scan two characters and compare them to the given keyword table of length 2 then of length 1. if identity found we put them into f3 with blanks surrounding them */

int si, s2, i;

f3 = fopenC'prog.cpp" , "w");

si = getc(f2) ;

while (si != EOF)

{

/* isolate comment symbols by blanks */

if (si == COPEN)

{

putcC ',f3);

putc (COPEN, f3);

putcC ',f3);

goto yy;

}

if (si == CCLOSE)

{

putcC ',f3);

putc (CCLOSE, f3) ;

putcC ',f3);

goto yy; /* isolate string symbols by blanks */

if (si == STRINGMARK)

{

putcC ',f3);

putc (STRINGMARK, f3) ;

putcC ',f3);

while ((si = getc(f2)) != STRINGMARK) putc (si,f3);

/* pathological exit(-l) if EOF reached */

putcC ',f3);

putc (STRINGMARK, f3) ;

putcC ',f3);

goto yy;

}

/* get the second character */

s2 = getc(f2) ;

/* switch exponent to internal representation */

if (isdigit(sl) && (s2=='E' || s2=='e'))

{

putc(sl,f3) ;

putc('"' ,f3) ;

sl=getc(f2) ;

if(sl=='-') putcC"' ,f3) ; else if(sl=='+'); else ungetc(sl,f2) ;

goto yy;

}

for(i=0; Knuml; i++)

{

/* excludes alphabetical tokens */

if ((kwtab[i] .keyword [0] == si) && (kwtab[i] .keyword [1] = (kwtab [i] . len == 2) && (isalpha(sl) == 0))

{

putcC ',f3);

putc(sl,f3) ;

putc(s2,f3) ;

putcC ',f3);

goto yy;

}

for(i=0; Knuml; i++)

{

/* excludes decimal point */

if ((kwtab [i] .keyword [0] == si) && (kwtab [i] . len == 1) && (si != '.'))

{

ungetc (s2 ,f2) ;

putcC ',f3);

putc(sl,f3) ; putcC ',f3);

goto yy;

}

ungetc(s2,f2) ;

putc(sl,f3) ;

yy: si = getc(f2) ;

}

fclose (f3) ;

return(0) ;

}

/* parsed_word table for lexical analysis token results */ struct pw { char keyword [TOKENSIZE] ;

int len;

int line;

char code [TOKENSIZE] ;

char *flag;

int numcode;

float tokenval;

} pwtab[MAX] ;

int num2; /* stores number of entries in token table */ int lexer ()

/* fills token table */

{

f2 = fopenC'prog.cpp" , "r");

i = t = w = s = found = tt = 0;

t = getc(f2) ; while (t != EOF)

{

if (t == '\η') i++; /* increment line number */

else if(t == ' ' || t == '\t'); /* skip blanks */

else

{

/* use the blank-filling done to the program in order to use scanf to fetch tokens */

ungetc(t,f2) ;

fscanf (f2, "°/.s", token);

/* skip comments */

/* skip strings */

/* measuring the length of the token */

w = strlen(token) ;

/* next paragraph job -seperate rear punctuation marks from token*/ /* if token greater than 1 char length */

if(w > 1)

{

if ((token [w-1] == '.') && (token [w-2] != '.')) /* in this case this is not a

rear punctuation mark but an

array mark i.e. ..

*/

{

w— ; /* make note that token length is w-1 */

found = 1; /* raise flag that this is the situation */

}

/* insert token into token table */

for(k=0; k<w; k++) pwtab[s] .keyword [k] = token [k] ;

/* ommit rear punctuation mark */

/* all tokens are id's unless otherwise found in comparO */

pwtab[s] .flag = "id" ; s++;

/* pathological - exit(-l) if s > MAX */

}

www: t = getc(f2) ;

}

num2 = s; /* number of entries in table is recorded */

return(0) ;

}

int comparO

/* compares token table with keyword table and assigns proper code

and status */

{

int i , j ,k, z ;

/* check for keywords */

for(i=0; i<num2; i++) /* run over token table to limit of entries == num2 */ {

for(j=0; j<numl; j++) /* run over keyword table to limit == numl */

/* if match found */

if (strcmpCpwtab [i] .keyword, kwtab[j] .keyword) == 0)

{

for(k=0; k<= strlen(kwtab [j] . code) ; k++)

pwtab [i] . code [k] = kwtab [j] . code [k] ;

pwtab [i] .numcode = kwtab [j] .numcode;

pwtab [i] . flag = "keyword"; /* i.e. this is keyword and not id */ break; /* this loop and continue to run over token table */

}

/* check for multi used id's in token table */

/* check for numbers, either length = #digits, or less because of exponent minus decimal point or exponent symbol, no alphabets should appear */ for(i=0; i<num2; i++) /* run all over token table (num2) */

{

k=0; /* counter of digits */

z=0; /* counter of alphabet characters if present */

for(j=0; j<p tab [i] . len; j++) /* run all over token */

{

if (isdigit (p tab [i] .keyword [j] ) ) k++;

if (isalpha(pwtab [i] .keyword [j] ) ) z++;

}

if((z == 0) /* no alphabets found */

&a ( (strcmp (pwtab [i] .flag, "id") == 0)

II (strcmp (pwtab [i] .flag, "multi-used-id") == 0))) /* not classified before as keyword or string */

{

if (k == pwtab [i] . len)

{

pwtab [i] . flag = "integer";

pwtab [i] .numcode = pwtab [i] . len;

}

else if((k > 0) aa

((k+1 == pwtab [i] .len) | |

(k+2 == pwtab [i] .len) | |

(k+3 == pwtab [i] .len)))

{

pwtab [i] . flag = "real";

pwtab [i] .numcode = pwtab [i] . len;

}

/* changing value of numbers from string to numbers */

for(i=0; i<num2; i++)

{

if (strcmp(pwtab [i] .flag, "integer")==0)

{

for(j=0; j<pwtab [i] . len; j++)

tokenvall = tokenvall*10 + pwtab [i] .keyword[j]-'0' ; pwtab [i] .tokenval = tokenvall;

tokenvall=0 ;

}

/* real case in a similar manner */

}

/* printing lexical analysis results */

fl = fopenC'lexreslt . cpp" , "w") ; fclose(f1) ;

return(0) ;

}

Claims

CLAIMli Lexical Analysis Tool orn&< ^T.