WO2016027170A2 - Lexical analysis tool - Google Patents

Lexical analysis tool Download PDF

Info

Publication number
WO2016027170A2
WO2016027170A2 PCT/IB2015/002222 IB2015002222W WO2016027170A2 WO 2016027170 A2 WO2016027170 A2 WO 2016027170A2 IB 2015002222 W IB2015002222 W IB 2015002222W WO 2016027170 A2 WO2016027170 A2 WO 2016027170A2
Authority
WO
WIPO (PCT)
Prior art keywords
keywords
token
keyword
tool
pwtab
Prior art date
Application number
PCT/IB2015/002222
Other languages
French (fr)
Other versions
WO2016027170A3 (en
Inventor
Isaiah Pinchas KANTOROVITZ
Original Assignee
Kantorovitz Isaiah Pinchas
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kantorovitz Isaiah Pinchas filed Critical Kantorovitz Isaiah Pinchas
Priority to PCT/IB2015/002222 priority Critical patent/WO2016027170A2/en
Publication of WO2016027170A2 publication Critical patent/WO2016027170A2/en
Publication of WO2016027170A3 publication Critical patent/WO2016027170A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis

Definitions

  • This paper provides an algorithm for constructing a lexical analysis tool, by different means than the UNIX Lex tool.
  • the input is a keywords table, describing the target language 's keywords, keysymbols, and their semantics, instead of using regular expressions to do so.
  • the output is a lexical analyzer for the specific programming language.
  • the tool can also be used as a translator engine by inputing a dictionary table, and as a pattern recognizer.
  • Tokens may be thought of as the fundamental building blocks of the language.
  • a token might be a keyword, a variable name, an integer, an arithmetic operator etc.
  • the task of scanning the source statement, recognizing and classifying the various tokens, is known as lexical analysis.
  • the part of the compiler that performs this analytic function is commonly called the scanner.
  • each statement in the program must be recognized as some language construct, such as a declaration or an assignment statement, described by the grammar. This process which is called parsing, is performed by a part of the compiler that is usually called parser. (See [4] for a simple construction)
  • Regular expressions a mathematical logic tool, was soon introduced in order to specify the tokens of a given programming language. Since the theory of regular expressions is dual to that of finite state automata, both were used - the first - to specify tokens, the latter - to describe the process of identifying tokens.
  • Lex UNIX
  • Lex UNIX
  • Lex A lexical analyzer created by Lex behaves in concert with the parser.
  • NUM DIGITS, OPTIONAL-FRACTION, OPTIONAL-EXPONENT.
  • Comment symbols are not keywords - they are defined in the beginning of the source program and can be modified from there.
  • Program to be analyzed must be in file named "program. cpp" . It must have blanks between alphabetical tokens (as is normal practice of program writers) .
  • IFX "IFX" is an id.
  • Non-alphabetical tokens i.e. keysymbols are of length ⁇ 2.
  • mainQ opens "keywords" file. It reads the keywords and keysymbols (skipping comments) , and inserts them into a keyword table.
  • filler () is a "blank manager" . It puts blanks between non-alphabetical tokens. Blanks between alphabetical tokens exist according to step 4.
  • lexer() It fills the token table simply by fetching strings till the blanks. The suffix punctuation is separated while checking one character backwards that it's not two dots etc.
  • comparQ compares the token table with the keyword table and gives lexical analysis results.
  • the method is sequential search:
  • int numl /* stores number of entries in keyword table */
  • f2 fopen("program. cpp" , "r");
  • f3 fopen("interlexreslt . cpp" , "w") ;
  • si and s2 store one character each, we inch through the file f2 by scanning and analyzing si and s2 each time i.e. we scan two characters and compare them to the given keyword table of length 2 then of length 1. if identity found we put them into f3 with blanks surrounding them */
  • fl fopenC'lexreslt . cpp" , "w") ; fclose(f1) ;

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

This paper provides an algorithm for constructing a lexical analyses tool, by different means than the UNIX Lex tool. The input is a keywords table, describing the target language's keywords, keysymbols, and their semantics, instead of using regular expressions to do so. The output is a lexical analyzer for the specific programming language. The tool can also be used as a translator engine by inputting a dictionary table, and as a pattern recognizer Keywords Compiler, Lexical Analysis, Scanner, Algorithm, Software Tool.

Description

Lexical Analysis Tool
Abstract
This paper provides an algorithm for constructing a lexical analysis tool, by different means than the UNIX Lex tool. The input is a keywords table, describing the target language 's keywords, keysymbols, and their semantics, instead of using regular expressions to do so.
The output is a lexical analyzer for the specific programming language. The tool can also be used as a translator engine by inputing a dictionary table, and as a pattern recognizer.
Keywords Compiler, Lexical Analysis, Scanner, Algorithm, Software Tool
1 Introduction
It is convenient to regard source program statements as a sequence of tokens rather than simply as a string of characters. Tokens may be thought of as the fundamental building blocks of the language. For example, a token might be a keyword, a variable name, an integer, an arithmetic operator etc. The task of scanning the source statement, recognizing and classifying the various tokens, is known as lexical analysis. The part of the compiler that performs this analytic function is commonly called the scanner. After the token scan, each statement in the program must be recognized as some language construct, such as a declaration or an assignment statement, described by the grammar. This process which is called parsing, is performed by a part of the compiler that is usually called parser. (See [4] for a simple construction)
There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing: simpler design, compiler efficiency is improved, compiler portability is enhanced.
Regular expressions, a mathematical logic tool, was soon introduced in order to specify the tokens of a given programming language. Since the theory of regular expressions is dual to that of finite state automata, both were used - the first - to specify tokens, the latter - to describe the process of identifying tokens.
It was quickly appreciated that tools to build lexical analyzers from regular expressions specifications would be useful in the implementation of compilers. Lex (UNIX) is an example. A lexical analyzer created by Lex behaves in concert with the parser. By changing the regular expressions inputed into Lex, we get different lexical analyzers for different programming languages (See [3]) . Further details can be found in Chapter 3 of [1] and [2] .
This paper is about constructing a lex-like tool, but from a different approach. It can be argued that the introduction of the finite state automata and regular expression model is justified for a text like: AAAAAAB ABABA...
when we are searching for patterns like: ABB
It is not justified when we try to analyze a statement like: COVARIENCE := 50;
Here we want an analysis:
ID, ASSIGN, NUM ;
General pattern recognition might recognize the keyword "VAR" inside "COVARIANCE" which is an ID.
It can be further argued, that while it is very justified to introduce the grammar model to the parsing phase, the lexical analysis phase is only complicated by general pattern recognition models like automata (see [5] and [6] ) .
This paper is interested in constructing a lex-like tool that would take advantage of the uniformity, or limited scope and vocabulary of programming languages. Thus we do not need a very general tool, but one who will tolerate minor differences between known programming languages, and will let the programmer define any new language within the normal variance (which is comfortably large) .
2 Algorithm
We begin with five preliminary steps.
1. The user of lex.cpp enters in to a file named "keywords. cpp" all the reserved keywords and meaningful keysymbols of the language in the form:
keysymbolj, blank, semantics-of-keysymbol_i, (i = 1 . . . n) .
2. The numbers and id's are considered uniformly defined for all programming languages:
NUM = DIGITS, OPTIONAL-FRACTION, OPTIONAL-EXPONENT.
ID = LETTER and (LETTER or DIGIT or NON-KEY-SYMBOL SYMBOL) *
Therefore they are taken care of inside the source program but can be modified from there. Exponent symbol will be defined inside the source program. Again this can be easily modified inside the source program.
3. Comment symbols are not keywords - they are defined in the beginning of the source program and can be modified from there.
4. Program to be analyzed must be in file named "program. cpp" . It must have blanks between alphabetical tokens (as is normal practice of program writers) .
Example:
IF X = "IF" is a keyword.
IFX = "IFX" is an id.
5. Non-alphabetical tokens i.e. keysymbols are of length < 2.
Example: "==" and are keysymbols.
This restriction can be modified within the source program. Blanks between non-alphabetical and alphabetical tokens are optional.
Example:
x:=5+u; is equal to x := 5 + u ; Data Flow:
• mainQ opens "keywords" file. It reads the keywords and keysymbols (skipping comments) , and inserts them into a keyword table.
• It then calls filler (). filler () is a "blank manager" . It puts blanks between non-alphabetical tokens. Blanks between alphabetical tokens exist according to step 4.
• The method is like putting two buckets: "program" and "prog" files (see [7] ) .
— We take 2 characters (si and s2) from "program" (add s3,...,sn for expanding step 5) .
— If the first is non-alphabetical, we try to match si and s2 with keywords table, (for example
— If we fail, we throw the second character, s2, back into "program" bucket, and we try to match si. (for example "+" ).
— If we fail, we just throw the character into the second bucket named "prog" .
— If we succeed with the match, we glue a blank on each side of the match and throw it into the second bucket "prog" .
— We exclude dot (.) from the process since it might be a decimal point and we do not want to separate a number with blanks around the decimal point.
— We continue till EOF.
The result is file "prog" , with blanks between all the tokens, except maybe rear punctuation marks (.;,:) namely dot. lexerQ takes care of them.
• Now we call lexer(). It fills the token table simply by fetching strings till the blanks. The suffix punctuation is separated while checking one character backwards that it's not two dots etc.
• Now we call comparQ . It compares the token table with the keyword table and gives lexical analysis results. The method is sequential search:
— We first search the token table for keywords using the keywords table. Then we search the remaining as id's to find multi-used id's.
— Then we search the remaining as numbers:
— If (length as a string) = (number of digits) integer.
— If (length as a string) = (number of digits + 1,2 or 3) real,
(because of decimal points and exponent symbols) .
— The remaining are id's.
References
[1] Aho, Sethi and Ulman, Compilers - Principles, Techniques and Tools, Addison- Wesley, Reading Massachusettes, 1986.
[2] Seppo Sippu and Eljas Soisalon Soininen, Parsing Theory Vol I, Springer- Verlag, Berlin 1988.
[3] John R. Levine, Tony Mason and Doug Brown, Unix Programming Tools - Lex and Yacc, O'Reilly and Associates Inc. , California 1992. [4] Leland L. Beck, System Software, Addison- Wesley, Reading Massachusettes 1990.
[5] Allen Holub, Compiler Design in C, Prentice-Hall, Englewood Cliffs New Jersey 1990.
[6] J. Heering P and Klint J. Rekers, Incremental Generation of Lexical Scanners, ACM Transactions on Programming Languages and Systems 14(4) (1992) , 490-520.
[7] W. Yang, On the Look-Ahead Problem in Lexical Analysis, Acta Informatica 32(5) (1995), 459- 476.
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <math.h>
#define T0KENSIZE 25 /* max size of keyword or token */
#define MAX 100 /* max number of tokens we can handle */ char token [TOKENSIZE] ; /* token reading buffer */
/* keyword table */
struct kw { char keyword [TOKENSIZE] ;
int len;
int line;
char code [TOKENSIZE] ;
int numcode;
} kwtab[MAX] ;
int numl /* stores number of entries in keyword table */
FILE *fl /* file descriptor for keywords file */
FILE *f2 /* file descriptor for sample program file */
FILE *f3 /* file descriptor for intermediate file */ int filler () ;
int lexer ();
int compar () ; mainO
/* fills keyword table */
{
int i,j,k,t,w,s;
/* open files for reading */
fl = fopen("keywords . cpp" , "r");
f2 = fopen("program. cpp" , "r");
f3 = fopen("interlexreslt . cpp" , "w") ;
t = getc(fl);
while (t != EOF)
{
if (t == '\η') i++; /* increment line number */
else if(t == ' ' | | t == '\t ' ) ; /* skip blanks */
else
{ ungetc(t, f1) ;
fscanf(fl, "°/.s", token);
/* insert keyword into keywords table (kwtab) */ fscanf (fl,"°/.s", kwtab [s] .code);
s++;
}
ww: t = getc(f1) ;
}
numl = s;
fclose (f3) ;
fillerO ;
lexer () ;
compar () ;
return(0) ;
}
int fillerO
/* puts blanks between non-alphabetical tokens of program.
stores the results inside file prog. */
/* assumption: length(non-alphabetical tokens) <= 2 in all languages. i.e. = * <= .. etc .
but we do not have *** === !=== etc.
this can be modified by using s3 in addition to si and s2 as scanner */
{
/* si and s2 store one character each, we inch through the file f2 by scanning and analyzing si and s2 each time i.e. we scan two characters and compare them to the given keyword table of length 2 then of length 1. if identity found we put them into f3 with blanks surrounding them */
int si, s2, i;
f3 = fopenC'prog.cpp" , "w");
si = getc(f2) ;
while (si != EOF)
{
/* isolate comment symbols by blanks */
if (si == COPEN)
{
putcC ',f3);
putc (COPEN, f3);
putcC ',f3);
goto yy;
}
if (si == CCLOSE)
{
putcC ',f3);
putc (CCLOSE, f3) ;
putcC ',f3);
goto yy; /* isolate string symbols by blanks */
if (si == STRINGMARK)
{
putcC ',f3);
putc (STRINGMARK, f3) ;
putcC ',f3);
while ((si = getc(f2)) != STRINGMARK) putc (si,f3);
/* pathological exit(-l) if EOF reached */
putcC ',f3);
putc (STRINGMARK, f3) ;
putcC ',f3);
goto yy;
}
/* get the second character */
s2 = getc(f2) ;
/* switch exponent to internal representation */
if (isdigit(sl) && (s2=='E' || s2=='e'))
{
putc(sl,f3) ;
putc('"' ,f3) ;
sl=getc(f2) ;
if(sl=='-') putcC"' ,f3) ; else if(sl=='+'); else ungetc(sl,f2) ;
goto yy;
}
for(i=0; Knuml; i++)
{
/* excludes alphabetical tokens */
if ((kwtab[i] .keyword [0] == si) && (kwtab[i] .keyword [1] = (kwtab [i] . len == 2) && (isalpha(sl) == 0))
{
putcC ',f3);
putc(sl,f3) ;
putc(s2,f3) ;
putcC ',f3);
goto yy;
}
}
for(i=0; Knuml; i++)
{
/* excludes decimal point */
if ((kwtab [i] .keyword [0] == si) && (kwtab [i] . len == 1) && (si != '.'))
{
ungetc (s2 ,f2) ;
putcC ',f3);
putc(sl,f3) ; putcC ',f3);
goto yy;
}
}
ungetc(s2,f2) ;
putc(sl,f3) ;
yy: si = getc(f2) ;
}
fclose (f3) ;
return(0) ;
}
/* parsed_word table for lexical analysis token results */ struct pw { char keyword [TOKENSIZE] ;
int len;
int line;
char code [TOKENSIZE] ;
char *flag;
int numcode;
float tokenval;
} pwtab[MAX] ;
int num2; /* stores number of entries in token table */ int lexer ()
/* fills token table */
{
f2 = fopenC'prog.cpp" , "r");
i = t = w = s = found = tt = 0;
t = getc(f2) ; while (t != EOF)
{
if (t == '\η') i++; /* increment line number */
else if(t == ' ' || t == '\t'); /* skip blanks */
else
{
/* use the blank-filling done to the program in order to use scanf to fetch tokens */
ungetc(t,f2) ;
fscanf (f2, "°/.s", token);
/* skip comments */
/* skip strings */
/* measuring the length of the token */
w = strlen(token) ;
/* next paragraph job -seperate rear punctuation marks from token*/ /* if token greater than 1 char length */
if(w > 1)
{
if ((token [w-1] == '.') && (token [w-2] != '.')) /* in this case this is not a
rear punctuation mark but an
array mark i.e. ..
*/
{
w— ; /* make note that token length is w-1 */
found = 1; /* raise flag that this is the situation */
}
}
/* insert token into token table */
for(k=0; k<w; k++) pwtab[s] .keyword [k] = token [k] ;
/* ommit rear punctuation mark */
/* all tokens are id's unless otherwise found in comparO */
pwtab[s] .flag = "id" ; s++;
/* pathological - exit(-l) if s > MAX */
}
www: t = getc(f2) ;
}
num2 = s; /* number of entries in table is recorded */
return(0) ;
}
int comparO
/* compares token table with keyword table and assigns proper code
and status */
{
int i , j ,k, z ;
/* check for keywords */
for(i=0; i<num2; i++) /* run over token table to limit of entries == num2 */ {
for(j=0; j<numl; j++) /* run over keyword table to limit == numl */
/* if match found */
if (strcmpCpwtab [i] .keyword, kwtab[j] .keyword) == 0)
{
for(k=0; k<= strlen(kwtab [j] . code) ; k++)
pwtab [i] . code [k] = kwtab [j] . code [k] ;
pwtab [i] .numcode = kwtab [j] .numcode;
pwtab [i] . flag = "keyword"; /* i.e. this is keyword and not id */ break; /* this loop and continue to run over token table */
}
}
/* check for multi used id's in token table */
/* check for numbers, either length = #digits, or less because of exponent minus decimal point or exponent symbol, no alphabets should appear */ for(i=0; i<num2; i++) /* run all over token table (num2) */
{
k=0; /* counter of digits */
z=0; /* counter of alphabet characters if present */
for(j=0; j<p tab [i] . len; j++) /* run all over token */
{
if (isdigit (p tab [i] .keyword [j] ) ) k++;
if (isalpha(pwtab [i] .keyword [j] ) ) z++;
}
if((z == 0) /* no alphabets found */
&a ( (strcmp (pwtab [i] .flag, "id") == 0)
II (strcmp (pwtab [i] .flag, "multi-used-id") == 0))) /* not classified before as keyword or string */
{
if (k == pwtab [i] . len)
{
pwtab [i] . flag = "integer";
pwtab [i] .numcode = pwtab [i] . len;
}
else if((k > 0) aa
((k+1 == pwtab [i] .len) | |
(k+2 == pwtab [i] .len) | |
(k+3 == pwtab [i] .len)))
{
pwtab [i] . flag = "real";
pwtab [i] .numcode = pwtab [i] . len;
}
}
}
/* changing value of numbers from string to numbers */
for(i=0; i<num2; i++)
{
if (strcmp(pwtab [i] .flag, "integer")==0)
{
for(j=0; j<pwtab [i] . len; j++)
tokenvall = tokenvall*10 + pwtab [i] .keyword[j]-'0' ; pwtab [i] .tokenval = tokenvall;
tokenvall=0 ;
}
/* real case in a similar manner */
}
/* printing lexical analysis results */
fl = fopenC'lexreslt . cpp" , "w") ; fclose(f1) ;
return(0) ;
}

Claims

CLAIMli Lexical Analysis Tool orn&< ^T.
PCT/IB2015/002222 2015-12-04 2015-12-04 Lexical analysis tool WO2016027170A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2015/002222 WO2016027170A2 (en) 2015-12-04 2015-12-04 Lexical analysis tool

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2015/002222 WO2016027170A2 (en) 2015-12-04 2015-12-04 Lexical analysis tool

Publications (2)

Publication Number Publication Date
WO2016027170A2 true WO2016027170A2 (en) 2016-02-25
WO2016027170A3 WO2016027170A3 (en) 2016-05-12

Family

ID=55351346

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2015/002222 WO2016027170A2 (en) 2015-12-04 2015-12-04 Lexical analysis tool

Country Status (1)

Country Link
WO (1) WO2016027170A2 (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997007452A1 (en) * 1995-08-15 1997-02-27 International Software Machines Programmable compiler
CN103999081A (en) * 2011-12-12 2014-08-20 国际商业机器公司 Generation of natural language processing model for information domain

Also Published As

Publication number Publication date
WO2016027170A3 (en) 2016-05-12

Similar Documents

Publication Publication Date Title
Owens et al. Regular-expression derivatives re-examined
Täckström et al. Efficient inference and structured learning for semantic role labeling
US6529865B1 (en) System and method to compile instructions to manipulate linguistic structures into separate functions
Levine Flex & Bison: Text Processing Tools
US6928448B1 (en) System and method to match linguistic structures using thesaurus information
Rahman et al. Natural software revisited
Dean et al. Agile parsing in TXL
CN106843840B (en) Source code version evolution annotation multiplexing method based on similarity analysis
US7676358B2 (en) System and method for the recognition of organic chemical names in text documents
Van Cranenburgh et al. Data-oriented parsing with discontinuous constituents and function tags
US7779049B1 (en) Source level optimization of regular expressions
Lindén et al. Hfst—a system for creating nlp tools
US5949993A (en) Method for the generation of ISA simulators and assemblers from a machine description
CN112699665A (en) Triple extraction method and device of safety report text and electronic equipment
Van Cranenburgh et al. Discontinuous parsing with an efficient and accurate DOP model
Zhong et al. Semantic scaffolds for pseudocode-to-code generation
Kumar et al. Sanskrit compound processor
US20080141230A1 (en) Scope-Constrained Specification Of Features In A Programming Language
Koskenniemi Finite state morphology and information retrieval
Iwama et al. Constructing parser for industrial software specifications containing formal and natural language description
Paakki Prolog in practical compiler writing
Kantorovitz Lexical analysis tool
Mössenböck Alex—a simple and efficient scanner generator
US20220004708A1 (en) Methods and apparatus to improve disambiguation and interpretation in automated text analysis using structured language space and transducers applied on automatons
Jain et al. Cascaded finite-state chunk parsing for Hindi language

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15834464

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15834464

Country of ref document: EP

Kind code of ref document: A2

122 Ep: pct application non-entry in european phase

Ref document number: 15834464

Country of ref document: EP

Kind code of ref document: A2

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23/10/2018)