Lecture 7: Semantic Analysis (Types and Symbol Tables)
Outline
- Type Checking Basics
- Type systems
- Type expressions
- Type system implementation
- Scope in type checking
- Symbol Tables
- Implementing symbol tables
- Use of symbol tables
- Types and Declarations
- Type Checking, Conversion, and Inference SDT's
- Type checking
- Type conversion
- Type inference
- Symbol Tables for Program Analysis and Transformation
- Summary of Lecture 7
Type checking in compilers
Non-context-free syntax
- Programs that are correct with respect to the language's lexical and context-free syntactic rules may still contain other syntactic errors.
- Lexical analysis and context-free syntax analysis are not powerful enough to ensure the correct usage of variables, objects, functions, statements, etc.
- Non-context-free syntactic analysis is known as semantic analysis.
Semantic analysis
- The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition.
- Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
- The information is collected incrementally by the analysis phases of a compiler and used by the synthesis phases to generate the target code.
- An important part of semantic analysis is type checking, where the compiler checks that each operator has matching operands.
- We shall study various types of symbol tables and SDT's for type checking in this lecture.
Incorrect programs
- Example 1: lexical analysis does not distinguish between different variable or function identifiers (it returns the same token for all identifiers):
Correct | Incorrect |
---|---|
int a; a=1; |
int a; b=1; |
- Example 2: syntax analysis does not correlate the declarations with the uses of variables in the program:
Correct | Incorrect |
---|---|
int a; a=1; |
a=1; |
- Example 3: syntax analysis does not correlate the types from the declarations with the uses of variables:
Correct | Incorrect |
---|---|
int a; a=1; |
int a; a=1.0; |
Goals of semantic analysis
- Semantic analysis ensures that the program satisfies a set of additional rules regarding the usage of programming constructs (variables, objects, expressions, statements).
- Examples of semantic rules:
- Variables must be declared before being used.
- A variable should not be declared multiple times in the same scope.
- In an assignment statement, the variable and the assigned expression must have the same type.
- The condition of an
if-statement
must have type Boolean.
- Categories of rules:
- Semantic rules regarding types.
- Semantic rules regarding scopes.
1. Type Checking Basics
Type systems
What are types?
- Types describe the values possibly computed during execution of the program.
- Types are predicates on values.
- e.g., "int x" in Java means x ∈ [-2^31, 2^31-1].
- Think "type = set of possible values".
- Type errors: improper, type-inconsistent operations during program execution.
- Type-safety: absence of type errors at run time.
Type information
Type information classifies a program's constructs (e.g., variables, statements, expressions, functions) into categories, and imposes rules on their use (in terms of those categories) with the goal of avoiding runtime errors.
Construct | Example | Type |
---|---|---|
variables | int a; |
integer location |
expressions | (a+1)==2 |
Boolean |
statements | a=1.0; |
void |
functions | int pow(int n, int m) |
int × int → int |
How to ensure type-safety?
- Bind (assign) types, then check types:
- Type binding: defines types for constructs in the program (e.g., variables, functions).
- Can be either explicit (
int x
) or implicit (x=1
). - Type consistency (safety) = correctness with respect to the type bindings.
- Can be either explicit (
- Type checking: static semantic checks to enforce the type safety of the program.
- Enforce a set of type-checking rules.
- Type binding: defines types for constructs in the program (e.g., variables, functions).
Static vs. dynamic typing
- Static and dynamic typing refer to type definitions (i.e., bindings of types to variables, expressions, etc.).
- Statically typed language: types are defined and checked at compile-time, and do not change during the execution of the program.
- E.g., Pascal, C, Java.
- Dynamically typed language: types defined and checked at run-time, during program execution.
- E.g., Lisp, Scheme, Smalltalk, Python.
Strong vs. weak typing
- Strong and weak typing refer to how much type consistency is enforced.
- Strongly typed languages: guarantees that accepted programs are type-safe: Java.
- Weakly typed languages: allow programs that contain type errors: C.
- Can achieve strong typing using either static or dynamic typing.
Why static checking?
- Efficient code: Dynamic checks slow down the program execution.
- Guarantees that all executions will be safe: With dynamic checking, you never know when the next execution of the program will fail due to a type error.
- But is conservative for sound systems; needs to be expressive: Reject few type-safe programs.
Type systems concept summary
- Static vs. dynamic typing: when to define/check types?
- Strong vs. weak typing: how many type errors?
- Sound type systems: statically catch all type errors (and possibly reject some programs that have no type errors).
Strong Typing | Weak Typing | |
---|---|---|
Static Typing | ML, Pascal, Java, Modula-3 | C, C++ |
Dynamic Typing | Scheme, Python, Smalltalk, PostScript | assembly code |
Type expressions
Recap: Type system
- Type is a predicate on value.
- Type expressions: describe the possible types in the program: e.g., int, string, array[], class, etc.
- Type system: defines types for language constructs (e.g., expressions, statements, ...).
Type expression
- Languages have basic types (a.k.a. primitive types or ground types).
- E.g., int, char, boolean.
- Build type expressions using basic types with:
- Type constructors
- Type aliases
Array type
- Various kinds of array types in different programming languages.
array(T)
: array with elements of type T and no bounds.- C, Java:
int[]
, Modula-3:array of integer
.
- C, Java:
array(T, S)
: array with size.- C:
int[10]
, Modula-3:array[10] of integer
. - May be indexed 0..size-1.
- C:
array(T, L, U)
: array with upper/lower bounds.- Pascal or Ada:
array[2..5] of integer
.
- Pascal or Ada:
array(T, S1, ..., Sn)
: multi-dimensional arrays.- FORTRAN:
real(3,5)
.
- FORTRAN:
Record type
- A record is
{id1: T1, ..., idn: Tn}
for some identifiersidi
and typesTi
. - Supports access operations on each field, with corresponding type.
- C:
struct { int a; float b; }
- Pascal:
record a: integer; b: real; end
- Objects: generalize the notion of records.
Pointers
- Pointer types characterize values that are addresses of variables of other types.
Pointer(T)
: pointer to an object of type T.- C pointers:
T*
(e.g.,int *x;
). - Pascal pointers:
^T
(e.g.,x: ^integer;
). - Java: Object references.
Function Types
- A function value can be invoked with some argument expressions with types
Ti
, returns return typeTr
. - Type:
T1 × T2 × ... × Tn → Tr
. - C/C++ functions:
int pow(int x, int y)
has typeint × int → int
. - Java: methods have function types.
- Some languages have first-class functions (usually in functional languages, e.g., ML, LISP).
- C and C++ have function pointers.
- Java does not.
Type alias
- Some languages allow type aliases (type definitions, equates):
- C:
typedef int int_array[];
- Modula-3:
type int_array = array of int;
- C:
- Java does not allow type aliases.
- Aliases are not type constructors!
int_array
is the same type asint[]
. - Different type expressions may denote the same type.
Type system implementation
Implementation approaches
- Use a separate class hierarchy for type ASTs:
- Translate type expressions to type objects during parsing:
- Bind names to type objects in the symbol table during subsequent AST traversal.
Scope in type checking
Processing type declaration
- Type declarations add new identifiers and their types in the symbol tables.
- Class definitions must be added to symbol table:
class_defn ::= CLASS ID:id { decls:d }
- Forward references require multiple passes over AST to collect legal names.
- Example:
class A { B b; } class B { ... }
- Example:
- Need to handle scope information.
Scope information
- Scope information characterizes the declaration of identifiers and the portions of the program where use of each identifier is allowed.
- Example identifiers: variables, functions, objects, labels.
- Lexical scope is a textual region in the program:
- Statement block
- Formal argument list
- Object body
- Function or method body
- Module body
- Whole program (multiple modules)
- Scope of an identifier: the lexical scope in which it is valid.
Scope examples in C:
- Scope of variables in statement blocks:
Other scope examples:
- Function arguments:
int factorial(int n) { /* scope of n is the function body */ }
- Labels:
void f() { goto I; ... I: a=1; /* scope of I is the function body */ }
- Object fields and methods:
Semantic rules for scopes
-
Main rules regarding scopes:
- Rule 1: Use an identifier only if it is defined in an enclosing scope.
- Rule 2: Do not declare identifiers of the same kind with identical names more than once in the same scope.
-
You can declare identifiers with the same name if they are of a different kind
2. Symbol Tables
- Semantic checks refer to properties of identifiers in the program - their scope or type.
- Need an environment to store the information about identifiers - Symbol Table (or Symtab).
- Each entry in the symbol table contains:
- The name of an identifier.
- Additional information: its kind, its type, its size, if it is constant, ...
Example Symbol Table:
NAME | KIND | TYPE | OTHER |
---|---|---|---|
foo | fun | int x int → bool | extern |
m | par | int | auto |
n | par | int | const |
tmp | var | bool | const |
Handling scope information in symbol table
- How to represent scope information in the symbol table?
- There is a hierarchy of scopes in the program.
- Use a similar hierarchy of symbol tables.
- One symbol table for each scope.
- Each symbol table contains the symbols declared in that lexical scope.
Example
Handling identifiers with the same name
- The hierarchical structure of symbol tables automatically solves the problem of resolving name collisions.
- To find the declaration of an identifier that is active at a program point:
- Start from the current scope.
- Go up in the hierarchy until you find an identifier with the same name, or fail.
Example:
Catching semantic errors
Symbol table operations
- Three main operations:
- Create a new empty symbol table with a given parent table.
- Insert a new identifier in a symbol table (or error on re-declaration).
- Look-up an identifier in a symbol table (or error if not found).
- Cannot build symbol tables during lexical analysis because the hierarchy of scopes is encoded in the syntax.
- Build the symbol tables either:
- While parsing, using semantic actions.
- After the AST is constructed.
Implementing symbol tables
Implementation methods
- Array implementation
- List implementation
- Hash-table implementation
-
Array implementation:
-
Simple implementation = array
- one entry per symbol.
- Scan the array for loopup, compare name at each entry.
-
Disadvantage:
- Table has fixed size.
- Need to know in advance the number of entries
-
-
List implementation:
- Dynamic structure = list
- One cell per entry in the table.
- Can grow dynamically during compilation.
- Disadvantage:
- Inefficient for large Symbol tables
- Need to scan half the list on average
- Dynamic structure = list
-
Hash-table implementation:
- Efficient implementation = hash table
- It is an array of lists (buckets)
- Uses a hashing function to map the symbol name to the corresponding bucket:
hashfunc : string → int
.
- Good hash function = even distribution in the buckets.
- We shall implement scopes by setting up a separate symbol table for each scope.
- Efficient implementation = hash table
Symbol table per scope (Chained symbol tables)
- A chain of symbol tables can represent nested scopes. A table for an inner scope holds a reference to the table for its parent (outer) scope.
- Lookup proceeds from the current scope's table up through the chain of parent tables.
Java implementation for Chained Symbol Tables:
// package symbols;
import java.util.*;
public class Env {
private Hashtable table;
protected Env prev;
public Env (Env p) {
table = new Hashtable();
prev = p;
}
public void put (String s, Symbol sym) {
table.put(s, sym);
}
public Symbol get(String s) {
for(Env e = this; e != null; e = e.prev ) {
Symbol found = (Symbol)(e.table.get(s));
if(found != null) return found;
}
return null;
}
}
Forward references
- A forward reference is the use of an identifier within the scope of its declaration, but before the declaration itself appears in the text.
- Any compiler phase that uses information from the symbol table must be performed after the table is fully constructed.
- This means you cannot type-check and build the symbol table at the same time if forward references are allowed.
- Example requiring 2 passes:
Use of symbol tables
SDT for translating with symbol tables An SDT can manage the creation and destruction of symbol tables for nested blocks.
- When entering a block (
{
), create a newEnv
and link it to the previous one. - When exiting a block (
}
), restore the previousEnv
. - For declarations (
type id;
), add the symbol to the currenttop
environment. - For uses (
id
), look up the symbol starting from thetop
environment.
Example Input:
Example Output (with types annotated):
In a recursive-descent parser, the saved
environment would be a local variable in the procedure for block
.
3. Types and Declarations
An SDT can process type declarations and compute type information (like type expressions and memory width) to store in the symbol table.
Grammar:
- D → T id; D | ε
- T → B C | record { D }
- B → int | float
- C → ε | [num] C
SDT with Semantic Rules:
Attributes type
and width
are synthesized up the parse tree.
- For
B
: Set base type and width (int
is 4,float
is 8). - For
C
: An inherited attribute passes the base type down, and synthesized attributes build up array types.C.type = array(num.value, C1.type)
andC.width = num.value * C1.width
. - For
T
: Combines the base type and array information. - For
D
: Adds the final type and computed offset to the symbol table for each identifier. - For
P
: A marker nonterminalM
can be used to initialize a globaloffset
variable, allowing the SDT to be written in postfix form. P → M D where M → ε {offset=0;}.
4. Type Checking, Conversion, and Inference SDT's
Type checking
Type synthesis and type inference
- To do type checking, a compiler needs to assign a type expression to each component of the source program.
- Type checking can take on two forms: synthesis and inference.
- Type synthesis builds up the type of an expression from the types of its subexpressions.
- Type inference determines the type of a language construct from the way it is used. It is needed for languages like ML, which check types but do not require names to be declared.
Type checking rules
- Type checking is the validation of a set of type rules. Examples:
- The type of a variable must match its declaration.
- Operands of
+
,*
,-
,/
must be integers; the result is an integer. - Operands of
==
,!=
must be integer or string; the result is boolean. - In an assignment, the variable and expression types must match.
- In a function call, actual argument types must match formal parameter types.
- The return value type must match the function's declared return type.
Semantic rules for type checking (Expressions):
E → literal
⇒E.type := 'char'
E → num
⇒E.type := 'int'
E → id
⇒E.type := lookup(id.entry)
E → E1 mod E2
⇒E.type := if E1.type == 'int' and E2.type == 'int' then 'int' else type_error
E → E1[E2]
⇒E.type := if E2.type == 'int' and E1.type == 'array(s,t)' then t else type_error
Semantic rules for type checking (Statements):
S → id := E
⇒S.type := if id.type == E.type then void else type_error
S → if E then S1
⇒S.type := if E.type == 'boolean' and S1.type == void then void else type_error
S → while E do S1
⇒S.type := if E.type == 'boolean' and S1.type == void then void else type_error
Type checking implementation
- Type checking can be implemented by an AST visitor.
class TypeCheck implements Visitor {
Object visit(Add e, Object symbolTable) {
Type t1 = (Type) e.e1.accept(this, symbolTable);
Type t2 = (Type) e.e2.accept(this, symbolTable);
if (t1 == Int && t2 == Int) return Int;
else throw new TypeCheckError("+");
}
Object visit(Num e, Object symbolTable) {
return Int;
}
Object visit(Id e, Object symbolTable) {
return ((SymbolTable)symbolTable).lookupType(e);
}
}
Types comparison
- Option 1: Implement
T1.Equals(T2)
. For OO languages, also needT1.SubtypeOf(T2)
. - Option 2: Use unique objects for each distinct type. This allows for faster comparison using
==
.
Type conversion
- Widening conversions: intended to preserve information (e.g.,
int
tofloat
). - Narrowing conversions: can lose information (e.g.,
double
toint
). - Explicit conversion (cast): programmer must write something to cause the conversion.
- Implicit conversion (coercion): done automatically by the compiler. Usually limited to widening conversions.
Type conversion implementation
- For an expression like
E → E1 + E2
:- Find the maximum type:
E.type = max(E1.type, E2.type)
. - Widen both operands to the max type:
a1 = widen(E1.addr, E1.type, E.type)
. - Generate code for the operation:
gen(E.addr '=' a1 '+' a2)
.
- Find the maximum type:
- The
widen
function generates conversion instructions if necessary.
Type inference
- Unification is the problem of determining whether two expressions
s
andt
can be made identical by substituting expressions for the variables ins
andt
. - This process is used to infer types in languages that don't require explicit type declarations.
- Example: For
fun length(x) = if null(x) then 0 else length(tl(x)) + 1;
, the compiler unifies type variables to deduce thatx
must be alist(α)
and the function's type islist(α) → integer
.
5. Symbol Tables for Program Analysis and Transformation
- Symbol tables are not just for compilation; they are crucial for advanced program analysis and transformation tools, like those used for automated refactoring.
- Generic Symbol Table: A generic, reusable symbol table implementation for statically scoped languages is available at
github.com/antlr/symtab
. - Object-oriented models: Some tools (like JDeodorant) use detailed object-oriented models to represent types, properties, and relationships for sophisticated analysis.
- SciTools Understand:
- Provides a code-search database based on Entities (anything in code like a file, class, variable) and References (a specific place where an entity appears, defining a relationship).
- A Kind Filter is a string used to filter lists (e.g.,
db->ents("public method")
). - The Python API allows querying this database to analyze code structure.
- Example Query: List the file and line where each function is defined.
import understand db = understand.open("test.udb") ents = db.ents("function ~unknown ~unresolved") for ent in sorted(ents, key=lambda ent: ent.longname()): ref = ent.ref("definein"); if ref is not None: print (ent.longname(), "(", ent.parameters(), ")"); print (" ", ref.file().relname(), "(", ref.line(), ")");
- Open Understand: An open-source implementation of the SciTools Understand Python API.
Summary
- Semantic checks ensure the correct usage of variables, objects, expressions, statements, functions, and labels in the program.
- Scope semantic checks ensure that identifiers are correctly used within the scope of their declaration.
- Type semantic checks ensure the type consistency of various constructs in the program.
- Symbol tables: a data structure for storing information about symbols in the program.
- Used in semantic analysis and subsequent compiler stages.
- Used in program analysis for code query.
References
- Aho, Alfred V., Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman (2006). Compilers: Principles, Techniques, and Tools (2nd Edition).
- Hagberg, Aric A., Daniel A. Schult and Pieter J. Swart (2008). “Exploring Network Structure, Dynamics, and Function using NetworkX”.
- Leicht, E. A. and M. E. J. Newman (mar. de 2008). “Community Structure in Directed Networks”.
- Mitchell, Brian S. and Spiros Mancoridis (mar. de 2006). “On the Automatic Modularization of Software Systems Using the Bunch Tool”.
- Parr, T. (2009). Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages.
- SciTools (2020). Understand.
- Tsantalis, Nikolaos and Alexander Chatzigeorgiou (2009). “Identification of Move Method Refactoring Opportunities”.
- Zafeiris, Vassilis E., Sotiris H. Poulias, N.A. Diamantidis and E.A. Giakoumakis (2017). “Automated refactoring of super-class method invocations to the Template Method design pattern”.