Morteza Zakeri

Services of the Intelligent Software Engineering Laboratory

2025-03-20T20:30:00+03:30

Amirkabir University of Technology (Tehran Polytechnic)
Author: Morteza Zakeri Version: 1.0 (March 2025)

Introduction

This document presents a comprehensive list of services offered by the Intelligent Software Engineering Laboratory at Amirkabir University of Technology (Tehran Polytechnic). The services span multiple facets of software engineering—including testing, quality assurance, secure coding, requirements engineering, and development methodologies. The items listed herein are categorized by group, level, and service code to facilitate both clarity and ease of ordering.

Contact and order now

Services

Group 1: Software Product Testing and Quality Assurance (Group Code: 1)

Level 1: Basic Practical Training Courses (Level 1) (Order Code: 11)

Service 1.1.1: Functional Testing of Software
Details: Testing levels, testing culture, tester roles, unit testing, integration testing, and system testing.
Order Code: 111
Service 1.1.2: Security Testing
Details: Stress testing, fault injection, fuzz testing, penetration testing, and penetration test scenarios.
Order Code: 112
Service 1.1.3: Performance Testing
Details: Basic testing, smoke tests, load testing, stress testing, capacity testing, load increase, stability assessment, breaking point determination, rapid change and stress testing, performance evaluation metrics, and performance test scenarios.
Order Code: 113

Level 2: Advanced Practical Training Courses (Level 2) (Order Code: 12)

Service 1.2.1: Advanced Software Testing
Details: Mutation testing, regression testing, model-based testing, model-driven testing, continuous testing, testing and evaluation of machine learning models (Software 2.0), test suite augmentation and optimization, and super testing.
Order Code: 121
Service 1.2.2: Automated Software Testing
Details: Automated test generation, injection and monitoring of test data, random-adaptive testing, search-based testing, symbolic (execution) testing, and concrete-symbolic (execution) testing.
Order Code: 122
Service 1.2.3: Software Debugging
Details: Dynamic analysis, spot analysis, fault localization, fault prediction, and program repair.
Order Code: 123
Service 1.2.4: Deployment of Tools for Testing, Debugging, and Dynamic Analysis
Details: Identification and deployment of tools, frameworks, platforms, and studios for testing, debugging, and dynamic analysis.
Order Code: 124

Level 3: Customized Training in Software Testing and Quality Assurance (Order Code: 13)

Service 1.3: Customized training course—tailored selection from the above topics.

Level 4: Engineering, Consulting, and Product Development Services (Order Code: 14)

Service 1.4.1: Functional Testing for Legacy Codebases
Details: Applicable to web, mobile, and desktop applications.
Order Code: 141
Service 1.4.2: Functional Testing for Codebases Under Development
Details: Applicable to web, mobile, and desktop applications.
Order Code: 142
Service 1.4.3: Installation and Deployment of Automated Testing Tools
Details: Tools installed in accordance with the product development stack (including LLM-based tools and compilers).
Order Code: 143
Service 1.4.4: Evaluation of Testing Efficiency and Estimation of Test Code Technical Debt
Details: Measurement of testability for web, mobile, desktop, embedded systems, and Software 2.0 projects and estimation of associated technical debt.
Order Code: 144
Service 1.4.5: Optimization and Enhancement of Test Suites, and Vulnerability Remediation
Details: Optimization and strengthening of the test suite for software projects (web, mobile, desktop, embedded systems, Software 2.0) with a reduction of test code technical debt and remediation of vulnerabilities.
Order Code: 145
Service 1.4.6: Security Testing Service
Details: Applicable to web, mobile, desktop, embedded systems, and Software 2.0.
Order Code: 146
Service 1.4.7: Performance Testing Service
Details: Applicable to web, mobile, desktop, embedded systems, and Software 2.0.
Order Code: 147
Service 1.4.8: Consulting on Quality Assurance and Testing Standards
Details: Guidance on recognizing and implementing software product quality assurance and testing standards.
Order Code: 148

Level 5: Issuance of Software Product Quality Certificate (External Quality) (Order Code: 15)

Service 1.5: Quality Certification Service.

Level 6: Customized Engineering, Consulting, and Product Development in Testing and Quality Assurance (Order Code: 16)

Service 1.6: Customized services as selected from the above topics.

Group 2: Software Development Quality Assurance Group (Group Code: 2)

Level 1: Basic Practical Training Courses (Level 1) (Order Code: 21)

Service 2.1.1: Clean Code
Details: Naming conventions, SOLID principles, functions, classes, comments, and code formatting.
Order Code: 211
Service 2.1.2: Clean Architecture
Details: Architecture types, architectural styles, architectural descriptions, the 4+1 view, clean architecture principles, and distance from the main sequence.
Order Code: 212
Service 2.1.3: Clean Coder
Details: Decision-making (knowing when to say yes or no), teamwork, time management, estimation, pressure management, and version control for both product and test code.
Order Code: 213
Service 2.1.4: Secure Coding
Details: Data validation, authentication and authorization, encryption, session management, exception handling, meeting security requirements, and code obfuscation.
Order Code: 214

Level 2: Advanced Practical Training Courses (Level 2) (Order Code: 22)

Service 2.2.1: Principles and Patterns in Software Engineering
Details: Topics include SOLID, PHAME, analysis patterns, design patterns, architecture patterns, anti-patterns, and refactoring patterns.
Order Code: 221
Service 2.2.2: Techniques for Program Transformation and Automated Software Refactoring
Details: Program transformation, repackaging, automated refactoring, and automatic measurement of software quality attributes.
Order Code: 222
Service 2.2.3: Software Reengineering and Migration Techniques
Details: Software clustering, extraction and migration of software architecture, and migration to System 2.0.
Order Code: 223
Service 2.2.4: Deployment of Tools for Static Analysis and Quality Assurance
Details: Identification and deployment of tools, frameworks, platforms, and studios for static analysis, refactoring, and ensuring the quality of software development.
Order Code: 224

Level 3: Customized Training in Software Development Quality Assurance (Order Code: 23)

Service 2.3: Customized training course—selection from the above topics.

Level 4: Engineering, Consulting, and Product Development Services (Order Code: 24)

Service 2.4.1: Evaluation of Internal Quality Attributes
Details: Assessing software maintainability and evolvability through metrics such as testability, understandability, reusability, readability, modifiability, flexibility, rigidity, and analyzability.
Order Code: 241
Service 2.4.2: Improvement of Internal Quality Attributes
Details: Implementing measures to enhance maintainability and evolvability considering the aforementioned quality metrics.
Order Code: 242
Service 2.4.3: Evaluation of External Quality Attributes – Dependability
Details: Assessment of reusability, efficiency and scalability, security, safety, reliability, and accessibility.
Order Code: 243
Service 2.4.4: Improvement of External Quality Attributes – Dependability
Details: Enhancing reusability, efficiency and scalability, security, safety, reliability, and accessibility.
Order Code: 244
Service 2.4.5: Implementation of Design by Contract and Source Code Clean-Up
Details: Aligning development practices with Clean Code principles and reducing technical debt in the code.
Order Code: 245
Service 2.4.6: Refactoring of Software Design and Source Code
Details: Applying Clean Architecture principles to minimize design-related technical debt.
Order Code: 246
Service 2.4.7: Software Reengineering and Architectural Migration
Details: Migrating software architecture to scalable systems while reducing architectural technical debt.
Order Code: 247
Service 2.4.8: Consulting and Mentoring on Software Quality Standards
Details: Guidance on the implementation of standards such as ISO/IEC 25010.
Order Code: 248

Level 5: Issuance of Software Development Quality Certificate (Internal Quality) (Order Code: 25)

Service 2.5: Quality Certification Service.

Level 6: Customized Engineering, Consulting, and Product Development in Software Development Quality (Order Code: 26)

Service 2.6: Customized services—selection from the above topics.

Group 3: Software Requirements Engineering and Development Methodologies Group (Group Code: 3)

Level 1: Basic Practical Training Courses (Level 1) (Order Code: 31)

Service 3.1.1: Agile Requirements Engineering
Details: Extraction, analysis, and specification of requirements; use of requirement templates in agile software development; evaluation and improvement of requirement quality.
Order Code: 311
Service 3.1.2: Introduction to Agile Software Development Approaches
Details: Overview of practices such as TDD, BDD, DevOps, MLOps, and CICD.
Order Code: 312
Service 3.1.3: Management of Software and IT Projects
Details: Size and cost estimation, project status monitoring, and management of various types of technical debt.
Order Code: 313

Level 2: Advanced Practical Training Courses (Level 2) (Order Code: 32)

Service 3.2.1: Automated Requirements Engineering
Details: Detection and repair of requirement smells, generation of acceptance tests, creation of conceptual models from requirements, code generation from requirements, generating explanations from code, and repair of traceability links.
Order Code: 321
Service 3.2.2: Systematic Research and Presentation in Software Engineering
Details: Techniques such as SMS, SLR, MVLR, alongside computational thinking and reverse computation methods.
Order Code: 322
Service 3.2.3: Compiler Engineering, Large Language Models, and Domain-Specific Languages
Details: Design and development of DSLs for specialized querying, guided engineering, and data description.
Order Code: 323
Service 3.2.4: Deployment of Tools for Requirements Engineering and Documentation
Details: Identification and deployment of frameworks, platforms, and studios for managing requirements and project documents.
Order Code: 324

Level 3: Customized Training Course (Order Code: 33)

Service 3.3: Customized training course—selection from the above topics.

Level 4: Engineering, Consulting, and Product Development Services (Order Code: 34)

Service 3.4.1: Extraction and Modeling of Software Requirements Documentation
Details: Analysis and modeling of requirements for new software systems.
Order Code: 341
Service 3.4.2: Business Process Modeling (BPMN) and Creation of Services Based on System 2.0
Details: Utilizing large language models to innovate service creation.
Order Code: 342
Service 3.4.3: Designing Checklists and Generating Acceptance Tests for Software Validation
Order Code: 343
Service 3.4.4: Creating UML Conceptual Models from Requirements Documentation
Order Code: 344
Service 3.4.5: Database Design and Normalization
Details: Designing and normalizing software databases in accordance with the software requirements documentation.
Order Code: 345
Service 3.4.6: Creation and Deployment of Domain-Specific Languages (DSLs)
Details: DSLs for development, guided engineering, and querying/reporting.
Order Code: 346
Service 3.4.7: Extraction and Repair of Legacy Code Documentation
Details: Addressing requirements, design, and implementation documentation for legacy codebases.
Order Code: 347
Service 3.4.8: Creation and Repair of Traceability Links for Legacy Codebases
Order Code: 348

Level 5: Issuance of Software Requirements and Documentation Quality Certificate (Validation) (Order Code: 35)

Service 3.5: Quality Certification Service.

Level 6: Customized Engineering, Consulting, and Product Development in Requirements Engineering and Software Methodologies (Order Code: 36)

Service 3.6: Customized services—selection from the above topics.

Conclusion

The Intelligent Software Engineering Laboratory at Amirkabir University of Technology (Tehran Polytechnic) provides a diverse range of services aimed at improving the efficiency, quality, and reliability of software systems. Our offerings encompass comprehensive training courses, targeted consulting services, and custom engineering solutions that address both current challenges and future requirements in software engineering. Whether your focus is on testing, quality assurance, secure development, or requirements engineering, our expertise can help guide your organization toward excellence.

Future Work

Looking ahead, our laboratory is committed to expanding our service portfolio by: - Integrating advanced AI and machine learning techniques into testing and quality assurance processes. - Developing cutting-edge tools and frameworks for automated refactoring and requirements engineering. - Broadening our training programs to include emerging methods and best practices in modern software development. - Enhancing our consulting services to better address the complexities of legacy systems and large-scale software infrastructures.

We eagerly anticipate collaborating with industry partners and academic institutions to drive innovation and excellence in software engineering.

Advanced Software Testing

2025-03-17T20:45:00+03:30

Advanced Software Testing

Graduate Course

Course Description

This course delves into advanced topics and cutting-edge techniques in software testing, empowering students to design and implement rigorous testing strategies for modern software systems. By covering methods such as automatic test data generation, metamorphic testing, fuzzing, hyper-property testing, and program analysis, the course equips students with tools to detect, isolate, and resolve complex software issues. Emphasis is placed on automation, scalability, and the theoretical foundations of testing to address real-world challenges in software quality assurance.

Course Objectives

By the end of this course, students will: 1. Understand the theoretical underpinnings of advanced software testing methodologies.

Develop practical skills in automated testing, including test data generation and fuzzing.
Apply advanced techniques such as metamorphic testing and abstract interpretation to tackle testing challenges.
Analyze and verify software properties using program analysis and hyper-property testing.
Explore state-of-the-art tools and frameworks to enhance software reliability and security.

Syllabus

Week 1-2: Fundamentals of Advanced Software Testing

Recap of basic software testing concepts
Challenges in modern software testing
Introduction to test automation and advanced testing methodologies

Week 3-4: Automatic Test Data Generation

Theory and algorithms for test data generation
Symbolic execution and constraint solving
Tools for automated test generation

Week 5-6: Metamorphic Testing

Addressing the oracle problem in software testing
Designing and applying metamorphic relations
Applications of metamorphic testing in various domains

Week 7-8: Fuzz Testing

Introduction to fuzzing techniques: black-box, gray-box, and white-box fuzzing
Coverage-guided fuzzing and mutation-based testing
Case studies of fuzzing tools (e.g., AFL, libFuzzer)

Week 9-10: Hyper-Property Testing

Understanding hyper-properties and their significance
Techniques for verifying and validating hyper-properties
Applications in security, privacy, and concurrency testing

Week 11-12: Program Analysis and Abstract Interpretation

Static and dynamic program analysis techniques
Abstract interpretation and its role in bug detection
Tools for program analysis (e.g., Clang Static Analyzer, CodeQL)

Week 13: Testing in the Real World

Integration of advanced testing techniques in agile and DevOps environments
Testing frameworks and automation pipelines
Challenges and opportunities in adopting advanced testing practices

Week 14: Capstone Project and Review

Designing and implementing a comprehensive testing strategy for a real-world software system
Final presentations and peer reviews

Course Assessment

Assignments (25%): Hands-on exercises in test automation, fuzzing, and program analysis.
Paper-based Exam (40%): Theoretical evaluation of advanced testing methodologies.
Capstone Project (25%): Group project involving the application of advanced testing techniques.
Participation (10%): Contributions to discussions, code reviews, and peer learning.

Resources

Textbooks:
Fuzzing: Brute Force Vulnerability Discovery by Michael Sutton, Adam Greene, and Pedram Amini
Introduction to Software Testing by Paul Ammann and Jeff Offutt
Online Platforms:
Tools and frameworks such as AFL, libFuzzer, Z3 (solver), and Clang Static Analyzer
Open-source repositories for testing datasets
Research Papers:
Recent studies on advanced testing techniques from top-tier conferences (ICSE, FSE, ASE, ICPC, and ISSTA)

Prerequisites

Familiarity with basic software testing principles and techniques.
Knowledge of programming languages (e.g., Java, Python, C++) and data structures.

Contact Information

For inquiries, feel free to reach out via my webpage: www.m-zakeri.github.io.

Software Architectures

2025-03-07T21:00:00+03:30

Course Description

This course offers an in-depth study of software architectures, emphasizing their role in creating scalable, maintainable, and high-performing systems. Students will explore various architectural styles, principles of clean architecture, and cutting-edge AI-driven architectural frameworks. By combining theoretical concepts with practical case studies, the course equips students with the skills to design robust architectures tailored to diverse application domains.

Course Objectives

By the end of this course, students will: 1. Understand the principles and best practices of modern software architecture.

Explore various architectural patterns, including monolithic, layered, microservices, and event-driven architectures.
Learn and apply clean architecture principles for maintainable and testable systems.
Design AI-driven architectures tailored to intelligent systems and applications.
Evaluate and adapt architectural decisions to meet real-world challenges in scalability, security, and performance.

Syllabus

Week 1-2: Introduction to Software Architecture

The role of architecture in software development
Core concepts and trade-offs in architectural decisions
Overview of architecture evaluation frameworks

Week 3-4: Classical Architectural Styles

Monolithic architecture: Advantages and limitations
Layered architecture: Principles and practical applications
Event-driven architecture: Asynchronous communication and scalability

Week 5-6: Component-Based and Service-Oriented Architectures

Component-based design principles
Introduction to microservices architecture
Best practices for service-oriented systems and RESTful APIs

Week 7-8: Principles of Clean Architecture

Introduction to clean architecture principles
Designing for testability and maintainability
Practical applications and refactoring for clean architecture

Week 9-10: AI-Driven Architectures

Architectural patterns for machine learning and AI systems
Designing scalable AI pipelines and inference systems
Challenges and solutions for deploying AI in production

Week 11-12: Advanced Topics in Software Architecture

Architecture for distributed systems and cloud-native applications
Security considerations in architectural design
Case studies: E-commerce, IoT, and healthcare systems

Week 13: Emerging Trends in Architecture

Exploring serverless architectures and function-as-a-service (FaaS)
Domain-driven design (DDD) and its applications
Future directions in architectural practices

Week 14: Capstone Project and Review

Designing a software architecture for a real-world problem
Final presentations and feedback from peers and instructors

Course Assessment

Assignments (25%): Hands-on tasks to design and evaluate architectural styles.
Paper-based Exam (40%): Theoretical evaluation of architectural principles and practices.
Capstone Project (25%): Collaborative design and implementation of a software architecture.
Participation (10%): Engagement in discussions, case studies, and peer reviews.

Resources

Textbooks:
Software Architecture in Practice by Len Bass, Paul Clements, and Rick Kazman
Clean Architecture: A Craftsman's Guide to Software Structure and Design by Robert C. Martin
Online Platforms:
Tools for architectural design and modeling, such as ArchiMate and UML tools
Cloud platforms (AWS, Azure, GCP) for hands-on exercises
Research Papers:
Recent studies on software architectures from conferences such as WICSA and ECSA

Prerequisites

Basic understanding of software engineering and system design principles.
Familiarity with programming and fundamental development methodologies.

Contact Information

For inquiries, feel free to reach out via my webpage: www.m-zakeri.github.io.

A gentle introduction to search-based software refactoring

2022-05-05T00:45:00+04:30

An Introduction to Search-Based Refactoring

Identifying the optimal sequence of refactoring operations to apply to a software system is a challenging optimization problem. This problem falls within the domain of Search-Based Software Engineering (SBSE), which leverages search techniques to address software engineering challenges. In the context of search-based refactoring, refactorings are applied stochastically to an initial software solution, which is then evaluated using a fitness function comprising one or more software quality metrics.

Despite the growing interest in SBSE, there is a notable lack of comprehensive technical documentation that details the implementation of robust search-based refactoring methodologies. This article aims to bridge this gap by discussing the principles of search-based refactoring and outlining its practical implementation at the source code level.

Search-Based Refactoring: A Comprehensive Guide

What is Search-Based Refactoring?

Refactoring is the process of improving the internal structure of software without altering its external behavior. This process ensures that the code becomes more maintainable, scalable, and efficient. However, finding the most effective sequence of refactoring steps is inherently complex due to the vast search space of possible transformations.

Search-based refactoring approaches this complexity as an optimization problem. The core idea is to model the software system, apply refactoring operations iteratively or stochastically, and evaluate the result using a fitness function. This function serves as the guiding metric, assessing various qualities of the software, such as modularity, code readability, or coupling and cohesion.

Why Search-Based Refactoring?

Search-based methods hold several advantages over manual or rule-based refactoring:

Automation: It reduces the effort and time required to identify and apply effective refactorings.
Scalability: Search-based techniques can handle large codebases where manual refactoring is impractical.
Optimization: By leveraging heuristic or metaheuristic search techniques (e.g., genetic algorithms, simulated annealing), it often identifies refactoring sequences that outperform human-crafted approaches in improving software quality.

Core Components of Search-Based Refactoring

Refactoring Operations: These are the transformations applied to the code, such as renaming a class, extracting a method, or replacing magic numbers with constants. Each operation must preserve the software's external behavior.
Fitness Function: The fitness function evaluates the quality of the refactored code. Common software quality metrics include cyclomatic complexity, code duplication, maintainability index, and coupling/cohesion ratios.
Search Algorithm: Metaheuristic algorithms such as genetic algorithms (GAs), particle swarm optimization (PSO), or hill-climbing methods are commonly used. These techniques explore the search space of possible refactoring sequences to maximize the fitness function.
Stopping Criteria: A predefined stopping condition, such as the number of iterations or a target fitness value, determines when the search process concludes.

My Refactoring Services for Large and Legacy Codebases

I specialize in providing professional refactoring services for companies dealing with large and legacy codebases. Such systems often present unique challenges, including outdated architecture, high coupling, and lack of documentation. Leveraging search-based refactoring and my expertise in software engineering, I offer the following:

Code Analysis and Assessment: Conducting an in-depth evaluation of the codebase to identify critical areas for improvement.
Custom Refactoring Solutions: Applying search-based refactoring techniques tailored to the specific needs and objectives of the client.
Software Quality Enhancement: Improving maintainability, scalability, and performance while preserving the functional integrity of the software.
Consulting and Training: Providing guidance and training to internal teams on adopting modern refactoring practices and integrating computational thinking into their workflows.

By addressing the challenges inherent in large and legacy systems, I empower organizations to modernize their software infrastructure and enhance its long-term value.

Future Work in Search-Based Refactoring

Search-based refactoring is a dynamic and evolving field, with several opportunities for future research and development:

Enhanced Fitness Functions: Developing multi-objective fitness functions that account for emerging software quality attributes such as energy efficiency and resilience.
Scalability Improvements: Designing algorithms that can efficiently handle ever-growing codebases in modern software systems.
Integration with CI/CD Pipelines: Automating search-based refactoring within continuous integration and deployment workflows to enable seamless updates.
AI-Driven Refactoring: Leveraging advanced AI models to predict optimal refactoring paths and adaptively learn from previous iterations.
Domain-Specific Refactoring: Exploring how search-based refactoring can be customized for specialized domains, such as embedded systems or high-performance computing.

Conclusion

Search-based refactoring represents a powerful methodology for addressing the complexity of modern software systems. By framing refactoring as an optimization problem and leveraging advanced search techniques, this approach can significantly enhance software quality and maintainability.

My professional services aim to bring this transformative technology to companies, particularly those burdened by large and legacy codebases. The future of search-based refactoring lies in its continued evolution to meet the demands of emerging technologies and industries. By fostering innovation and collaboration, we can unlock the full potential of computational thinking in software engineering.

CodART: Automated Source Code Refactoring Toolkit

2022-05-02T23:58:00+04:30

Abstract— Software refactoring is performed by changing the software structure without modifying its external behavior. Many software quality attributes can be enhanced through the source code refactoring, such as reusability, flexibility, understandability, and testability. Refactoring engines are tools that automate the application of refactorings: first, the user chooses a refactoring to apply, then the engine checks if the transformation is safe, and if so, transforms the program. Refactoring engines are a key component of modern Integrated Development Environments (IDEs), and programmers rely on them to perform refactorings. In this project, an open-source software toolkit for refactoring Java source codes, namely CodART, will be developed. ANTLR parser generator is used to create and modify the program syntax-tree and produce the refactored version of the program. To the best of our knowledge, CodART is the first open-source refactoring toolkit based on ANTLR.

Index Terms: Software refactoring, refactoring engine, search-based refactoring, ANTLR, Java.

1 Introduction

Refactoring is a behavior-preserving program transformation that improves the design of a program. Refactoring engines are tools that automate the application of refactorings. The programmer need only select which refactoring to apply, and the engine will automatically check the preconditions and apply the transformations across the entire program if the preconditions are satisfied. Refactoring is gaining popularity, as evidenced by the inclusion of refactoring engines in modern IDEs such as IntelliJ IDEA, Eclipse, or NetBeans for Java.

Considering the EncapsulateField refactoring as an illustrative example. This refactoring replaces all references to a field with accesses through setter and getter methods. The EncapsulateField refactoring takes as input the name of the field to encapsulate and the names of the new getter and setter methods. It performs the following transformations:

Creates a public getter method that returns the field's value,
Creates a public setter method that updates the field's value, to a given parameter's value,
Replaces all field reads with calls to the getter method,
Replaces all field writes with calls to the setter method,
Changes the field's access modifier to private.

The EncapsulateField refactoring checks several preconditions, including that the code does not already contain accessor methods and that these methods are applicable to the expressions in which the field appears. Figure 1 shows a sample program before and after encapsulating the field f into the getF and setF methods.

Figure 1. Example EncapsulateField refactoring

Refactoring engines must be reliable. A fault in a refactoring engine can silently introduce bugs in the refactored program and lead to challenging debugging sessions. If the original program compiles, but the refactored program does not, the refactoring is obviously incorrect and can be easily undone. However, if the refactoring engine erroneously produces a refactored program that compiles but does not preserve the semantics of the original program, this can have severe consequences.

To perform refactoring correctly, the tool has to operate on the syntax tree of the code, not on the text. Manipulating the syntax tree is much more reliable to preserve what the code is doing. Refactoring is not just understanding and updating the syntax tree. The tool also needs to figure out how to rerender the code into text back in the editor view, called code transformation. All in all, implementing decent refactoring is a challenging programming exercise, required compiler knowledge.

In this project, we develop CodART, a toolkit for applying a given refactoring on the source code and obtain the refactored code. To this aim, we will use ANTLR [1] to generate and modify the program syntax tree. CodART development consists of two phases: In the first phase, 47 common refactoring operations will be automated, and in the second phase, an algorithm to find the best sequence of refactorings to apply on a given software will be developed using many-objective search-based approaches.

The rest of this white-paper is organized as follows. Section 2 describes the refactoring operations in detail. Section 3 explains code smells in detail. Section 4 briefly discusses the search-based refactoring techniques and many-objective evolutionary algorithms. Section 5 explains the implementation details of the current version of CodART. Section 6 lists the Java project used to evaluate CodART. Section 7 articulates the proposals that existed behind the CodART projects. Finally, the conclusion and future works are discussed in Section 8.

2 Refactoring operations

This section explains the refactoring operations used in the project. A catalog of 72 refactoring operations has been proposed by Fowler [2]. We called this refactorings atomic refactoring operations.

Each refactoring operation has a definition and is clearly specified by the entities in which it is involved and the role of each. Table 1 describes the desirable refactorings, which we aim to automate them. It worth noting that not all of these refactoring operations are introduced by Fowler [2]. A concrete example for most of the refactoring operations in the table is available at https://refactoring.com/catalog/. Examples of other refactorings can be found at https://refactoring.guru/refactoring/techniques and https://sourcemaking.com/refactoring/refactorings.

Table 1. Refactoring operations

Refactoring	Definition	Entities	Roles
Move class	Move a class from a package to another	package class	source package, target package moved class
Move method	Move a method from a class to another.	class method	source class, target class moved method
Merge packages	Merge the elements of a set of packages in one of them	package	source package, target package
Extract/Split package	Add a package to compose the elements of another package	package	source package, target package
Extract class	Create a new class and move fields and methods from the old class to the new one	class method	source class, new class moved methods
Extract method	Extract a code fragment into a method	method statement	source method, new method moved statements
Inline class	Move all features of a class in another one and remove it	class	source class, target class
Move field	Move a field from a class to another	class field	source class, target class field
Push down field	Move a field of a superclass to a subclass	class field	super class, sub classes move field
Push down method	Move a method of a superclass to a subclass	class method	super class, sub classes moved method
Pull up field	Move a field from subclasses to the superclass	class field	sub classes, super class moved field
Pull up method	Move a method from subclasses to the superclass	class method	sub classes, super class moved method
Increase field visibility	Increase the visibility of a field from public to protected, protected to package or package to private	class field	source class source filed
Decrease field visibility	Decrease the visibility of a field from private to package, package to protected or protected to public	class field	source class source filed
Make field final	Make a non-final field final	class field	source class source filed
Make field non-final	Make a final field non-final	class field	source class source filed
Make field static	Make a non-static field static	class field	source class source filed
Make field non-static	Make a static field non-static	class field	source class source filed
Remove field	Remove a field from a class	class field	source class source filed
Increase method visibility	Increase the visibility of a method from public to protected, protected to package or package to private	class method	source class source method
Decrease method visibility	Decrease the visibility of a method from private to package, package to protected or protected to public	class method	source class source method
Make method final	Make a non-final method final	class method	source class source method
Make method non-final	Make a final method non-final	class method	source class source method
Make method static	Make a non-static method static	class method	source class source method
Make method non-static	Make a static method non-static	class method	source class source method
Remove method	Remove a method from a class	class method	source class source method
Make class-final	Make a non-final class final	class	source class
Make class non-final	Make a final class non-final	class	source class
Make class abstract	Change a concrete class to abstract	class	source class
Make class concrete	Change an abstract class to concrete	class	source class
Extract subclass	Create a subclass for a set of features	class method	source class, new subclass moved methods
Extract interface	Extract methods of a class into an interface	class method	source class, new interface interface methods
Inline method	Move the body of a method into its callers and remove the method	method	source method, callers method
Collapse hierarchy	Merge a superclass and a subclass	class	superclass, subclass
Remove control flag	Replace control flag with a break	class method	source class source method
Replace nested conditional with guard clauses	Replace nested conditional with guard clauses	class method	source class source method
Replace constructor with a factory function	Replace constructor with a factory function	class	source class
Replace exception with test	Replace exception with precheck	class method	source class source method
Rename field	Rename a field	class field	source class source filed
Rename method	Rename a method	class method	source class source method
Rename class	Rename a class	class	source class
Rename package	Rename a package	package	source package
Encapsulate field	Create setter/mutator and getter/accessor methods for a private field	class field	source class source filed
Replace parameter with query	Replace parameter with query	class method	source class source method
Pull up constructor body	Move the constructor	class method	subclass class, superclass constructor
Replace control flag with break	Replace control flag with break	class method	source class source method
Remove flag argument	Remove flag argument	class method	source class source method
Total	47	—	—

3 Code smells

Deciding when and where to start refactoring—and when and where to stop—is just as important to refactoring as knowing how to operate its mechanics [2]. To answer this important question, we should know the refactoring activities. The refactoring process consists of six distinct activities [9]:

Identify where the software should be refactored.
Determine which refactoring(s) should be applied to the identified places.
Guarantee that the applied refactoring preserves behavior.
Apply the refactoring.
Assess the effect of the refactoring on quality characteristics of the software (e.g., complexity, understandability, maintainability) or the process (e.g., productivity, cost, effort).
Maintain the consistency between the refactored program code and other software artifacts (such as documentation, design documents, requirements specifications, tests, etc.).

Table 2. Code smells

Code smell	Descriptions and other names
God class	The class defines many data members (fields) and methods and exhibits low cohesion. The god class smell occurs when a huge class surrounded by many data classes acts as a controller (i.e., takes most of the decisions and monopolizes the software's functionality). Other names: Blob, large class, brain class.
Long method	This smell occurs when a method is too long to understand and most presumably perform more than one responsibility. Other names: God method, brain method, large method.
Feature envy	This smell occurs when a method seems more interested in a class other than the one it actually is in.
Data class	This smell occurs when a class contains only fields and possibly getters/setters without any behavior (methods).
Shotgun surgery	This smell characterizes the situation when one kind of change leads to many changes to multiple different classes. When the changes are all over the place, they are hard to find, and it is easy to miss a necessary change.
Refused bequest	This smell occurs when a subclass rejects some of the methods or properties offered by its superclass.
Functional decomposition	This smell occurs when the experienced developers coming from procedural languages background write highly procedural and non-object-oriented code in an object-oriented language.
Long parameter list	This smell occurs when a method accepts a long list of parameters. Such lists are hard to understand and difficult to use.
Promiscuous package	A package can be considered promiscuous if it contains classes implementing too many features, making it too hard to understand and maintain. As for god class and long method, this smell arises when the package has low cohesion since it manages different responsibilities.
Misplaced class	A Misplaced Class smell suggests a class that is in a package that contains other classes not related to it.
Switch statement	This smell occurs when switch statements that switch on type codes are spread across the software system instead of exploiting polymorphism.
Spaghetti code	This smell refers to an unmaintainable, incomprehensible code without any structure. The smell does not exploit and prevents the use of object-orientation mechanisms and concepts.
Divergent change	Divergent change occurs when one class is commonly changed in different ways for different reasons. Other names: Multifaceted abstraction
Deficient encapsulation	This smell occurs when the declared accessibility of one or more members of abstraction is more permissive than actually required.
Swiss army knife	This smell arises when the designer attempts to provide all possible uses of the class and ends up in an excessively complex class interface.
Lazy class	Unnecessary abstraction
Cyclically-dependent modularization	This smell arises when two or more abstractions depend on each other directly or indirectly.
Primitive obsession	This smell occurs when primitive data types are used where an abstraction encapsulating the primitives could serve better.
Speculative generality	This smell occurs where abstraction is created based on speculated requirements. It is often unnecessary that makes things difficult to understand and maintain.
Message chains	A message chain occurs when a client requests another object, that object requests yet another one, and so on. These chains mean that the client is dependent on navigation along with the class structure. Any changes in these relationships require modifying the client.
Total	20

4 Search-based refactoring

After refactoring operations were automated, we must decide which refactorings souled be performed in order to elevate software quality. The concern about using refactoring operations in Table 1 is whether each one of them has a positive impact on the refactored code quality or not. Finding the right sequence of refactorings to be applied in a software artifact is considered a challenging task since there is a wide range of refactorings. The ideal sequence is, therefore, must correlate to different quality attributes to be improved as a result of applying refactorings.

Finding the best refactoring sequence is an optimization problem that can be solved by search techniques in the field known as Search-Based Software Engineering (SBSE) [3]. In this approach, refactorings are applied stochastically to the original software solution, and then the software is measured using a fitness function consisting of one or more software metrics. There are various metric suites available to measure characteristics like cohesion and coupling, but different metrics measure the software in different ways, and thus how they are applied will have a different effect on the outcome.

The second phase of this project is to use a many-objective search algorithm to find the best sequence of refactoring on a given project. Recently, many-objective SBSE approach for refactoring [3]–[5] and remodularization, regrouping a set of classes C in terms of packages P, [6] has gained more attention due to its ability to find the best sequence of refactoring operations which is led to the improvement in software quality. Therefore, we first focus on implementing the proposed approach approaches in [3], [5], [6] as fundamental works in this area. Then, we will improve their approach. As a new contribution, we add new refactoring operations and new objective functions to improve the quality attribute of the software. We also evaluate our method on the new software projects which are not used in previous works.

5 Implementation

This section describes implementation details of the CodART. It includes CodART architecture, high-level repository directories structure, refactoring automation with ANTLR parser generator, and refactoring recommendation through many-objective search-based software engineering techniques.

5.1 CodART architecture

The high-level architecture of CodART is shown in Figure 2. The source code consists of several Python packages and directories. We briefly describe each component in CodART.

Figure 2. CodART architecture

I. grammars: This directory contains three ANTLR4 grammars for the Java programming language:

Java9_v2.g4: This grammar was used in the initial version of CodART. The main problem of this grammar is that parsing large source code files is performed very slow due to some decisions used in grammar design. We have switched to the fast grammar JavaParserLabled.g4.
JavaLexer.g4: The lexer of Java fast grammar. This lexer is used for both fast parsers, i.e., JavaParser.g4 and JavaParserLabeled.
JavaParser.g4: The original parser of Java fast grammar. This parser is currently used in some refactoring. In the future release, this grammar will be replaced with JavaPaseredLabled.g4.
JavaParserLabeled.g4: This file contains the same JavaParsar.g4 grammar. The only difference is that the rules with more than one extension are labled with a specific name. The ANTLR parser generator thus generates separate visitor and listener methods for each extension. This grammar facilitates the development of some refactoring. It is the preferred parser in CodART project.

II. gen: The gen packages contain all generated source code for the parser, lexer, visitor, and listener for different grammars available in the grammars directory. To develop refactorings and code smells, gen.JavaLabled package, which contains JavaParserLabled.g4 generated source code, must be used. The content of this package is generated automatically, and therefore it should not be modified manually. Modules within this gen package are just for importing and using in other modules.

III. speedy: The python implementation for ANTLR is less efficient than Java or C++ implementation. The speedy module implements a Java parser with a C++ back-end, improving the efficiency and speed of parsing. It uses speedy-antlr implementation with some minor changes. The current version of the speedy module use java9_v2.g4 grammar, which inherently slow as described. To switch to C++ back-end, first, the speedy module must be installed on the client system. It requires a C++ compiler. We recommended to CodART developers using the Python back-end as switching to C++ back-end would be done transparently in the future release. The Python back-end saves debugging and developing time.

IV. refactorings: The refactorings package is the main package in the CodART project and contains numerous Python modules that form the kernel functionalities of CodART. Each module implements the automation of one refactoring operation according to standard practices. The modules may include several classes which inherit from ANTLR listeners. Sub-packages in this module contain refactorings, which are in an early step of development or deprecated version of an existing refactoring. This package is under active development and testing. The module in the root packages can be used for testing purposes.

V. refactoring_design_patters: The refactoring_design_pattern packages contain modules that implement refactoring to a specific design pattern automatically.

VI. smells: The smell package implements the automatic detection of software code and design smells relevant to the refactoring operation supported by CodART. Each smell corresponds to one or more refactoring in the refactoring package.

VII. metrics: The metrics packages contain several modules that implement the computation of the most well-known source code metrics. These metrics used to detect code smells and measuring the quality of software in terms of quality attributed.

VIII. tests: The test directory contains individual test data and test cases that are used for developing specific refactorings. Typically, each test case is a single Java file that contains one or more Java classes.

IX. benchmark_projects: This directory contains several open-source Java projects formerly used in automated refactoring researches by many researchers. Once the implementation of refactoring is completed, it will be executed and tested on all projects in this benchmark to ensure the generalization of functionality proposed by the implementation.

X. Other packages: The information of other packages will be announced in the future.

5.2 Refactoring automation

Each refactoring operation in Table 1 is implemented as an API, with the refactoring name. The API receives the involved entities with their refactoring roles and other required data as inputs, checks the feasibility of the refactoring using refactoring preconditions described in [2], performs the refactoring if it is feasible, and returns the refactored code or return null if the refactoring is not feasible.

The core of our refactoring engine is a syntax-tree modification algorithm. Fundamentally, ANTLR is used to generate and modify the syntax-tree of a given program. Each refactoring API is an ANTLR Listener or visitor class, which required argument by its constructor and preform refactoring when call by parse-tree walker object. The refactoring target and input parameters must read from a configuration file, which can be expressed in JSON, XML, or YAML formats.

The key to use ANTLR for refactoring tasks is the TokenStreamRewriter object that knows how to give altered views of a token stream without actually modifying the stream. It treats all of the manipulation methods as "instructions" and queues them up for lazy execution when traversing the token stream to render it back as text. The rewriter executes those instructions every time we call getText(). This strategy is very effective for the general problem of source code instrumentation or refactoring. The TokenStreamRewriter is a powerful and extremely efficient means of manipulating a token stream.

5.3 Refactoring recommendation

A solution consists of a sequence of n refactoring operations applied to different code elements in the source code to fix. In order to represent a candidate solution (individual/chromosome), we use a vector-based representation. Each vector’s dimension represents a refactoring operation where the order of applying these refactoring operations corresponds to their positions in the vector. The initial population is generated by randomly assigning a sequence of refactorings to some code fragments. Each generated refactoring solution is executed on the software system S. Once all required data is computed, the solution is evaluated based on the quality of the resulting design.

6 Benchmark projects and testbed

To ensure CodART works properly, we are running it on many real-life software projects. Refactorings are applied to the software systems listed in Table 3. Benchmark projects may update and extend in the future. For the time being, we use a set of well-known open-source Java projects that have been intensely studied in previous works. We have also added two new Java software programs, WEKA and ANTLR, to examine the versatility of CodART performance on real-life software projects.

Table 3. Software systems refactored in this project

System	Release	Previous releases	Domain	Reference
Xerces-J	v2.7.0	--	software packages for parsing XML	[3], [6]
Azureus	v2.3.0.6	--	Java BitTorrent client for handling multiple torrents	[3]
ArgoUML	v0.26 and v0.3	--	UML tool for object-oriented design	[3]
Apache Ant	v1.5.0 and v1.7.0	--	Java build tool and library	[3]
GanttProject	v1.10.2 and v1.11.1	--	project management	[3], [6], [5]
JHotDraw	v6.1 and v6.0b1 and v5.3	--	graphics tool	[6], [5], [4]
JFreeChart	v1.0.9	--	chart tool	[6]
Beaver	v0.9.11 and v0.9.8	--	parser generator	[5], [4]
Apache XML-RPC	v3.1.1	--	B2B communications	[5], [4]
JRDF	v0.3.4.3	--	semantic web (resource management)	[5]
XOM	v1.2.1	--	XML tool	[5]
JSON	v1.1	--	software packages for parsing JSON	[4]
JFlex	v1.4.1	--	lexical analyzer generator	[4]
Mango	v2.0.1	--	--	[4]
Weka	v3.9	--	data mining tool	New
ANTLR	v4.8.0	--	parser generator tool	New

7 CodART in IUST

Developing a comprehensive refactoring engine required thousand of hours of programming. Refactoring is not just understanding and updating the syntax tree. The tool also needs to figure out how to rerender the code into text back in the editor view. According to a quote by Fowler [2] in his well-known refactoring book: “implementing decent refactoring is a challenging programming exercise—one that I’m mostly unaware of as I gaily use the tools.”

We have defined the basic functionalities of the CodART system as several student projects with different proposals. Students who will take our computer science course, including compiler design and construction, advanced compilers, and advanced software engineering, must be worked on these proposals as part of their course fulfillments. These projects try to familiarize students with the practical usage of compilers from the software engineering point of view. The detailed information of our current proposals are available in the following links:

Core refactoring operations development (Fall 2020)
Core code smells development Current semester (Winter and Spring 2021)
Core search-based development (Future semesters)
Core refactoring to design patterns development (Future semesters)

Note: Students whose final project is confirmed by the reverse engineering laboratory have an opportunity to work on CodART as an independent and advanced research project. The only prerequisite is to pass the compiler graduate course by Dr. Saeed Parsa.

8 Conclusion and remarks

Software refactoring is used to reduce the costs and risks of software evolution. Automated software refactoring tools can reduce risks caused by manual refactoring, improve efficiency, and reduce software refactoring difficulties. Researchers have made great efforts to research how to implement and improve automated software refactoring tools. However, the results of automated refactoring tools often deviate from the intentions of the implementer. The goal of this project is to propose an open-source refactoring engine and toolkit that can automatically find the best refactoring sequence required for a given software and apply this sequence. Since the tool is work based on compiler principles, it is reliable to be used in practice and has many benefits for software developer companies. Students who participate in the project will learn compiler techniques such as lexing, parsing, source code analysis, and source code transformation. They also learn about software refactoring, search-based software engineering, optimization, software quality, and object-orient metrics.

Conflict of interest

The project is supported by the IUST Reverse Engineering Research Laboratory. Interested students may continue working on this project to fulfill their final bachelor and master thesis or their internship.

References

[1] T. Parr and K. Fisher, “LL(*): the foundation of the ANTLR parser generator,” Proc. 32nd ACM SIGPLAN Conf. Program. Lang. Des. Implement., pp. 425–436, 2011.

[2] M. K. B. Fowler, Refactoring: improving the design of existing code, Second Edi. Addison-Wesley, 2018.

[3] M. W. Mkaouer, M. Kessentini, S. Bechikh, M. Ó Cinnéide, and K. Deb, “On the use of many quality attributes for software refactoring: a many-objective search-based software engineering approach,” Empir. Softw. Eng., vol. 21, no. 6, pp. 2503–2545, Dec. 2016.

[4] M. Mohan, D. Greer, and P. McMullan, “Technical debt reduction using search based automated refactoring,” J. Syst. Softw., vol. 120, pp. 183–194, Oct. 2016.

[5] M. Mohan and D. Greer, “Using a many-objective approach to investigate automated refactoring,” Inf. Softw. Technol., vol. 112, pp. 83–101, Aug. 2019.

[6] W. Mkaouer et al., “Many-Objective Software Remodularization Using NSGA-III,” ACM Trans. Softw. Eng. Methodol., vol. 24, no. 3, pp. 1–45, May 2015.

[7] M. Mohan and D. Greer, “MultiRefactor: automated refactoring to improve software quality,” 2017, pp. 556–572.

[8] N. Tsantalis, T. Chaikalis, and A. Chatzigeorgiou, “Ten years of JDeodorant: lessons learned from the hunt for smells,” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018, pp. 4–14.

[9] T. Mens and T. Tourwe, “A survey of software refactoring,” IEEE Trans. Softw. Eng., vol. 30, no. 2, pp. 126–139, Feb. 2004.

Endnotes

[1] https://www.jetbrains.com/idea/

[2] http://www.eclipse.org

[3] http://www.netbeans.org

[4] https://github.com/mmohan01/MultiRefactor

[5] http://sourceforge.net/projects/recoder

[6] http://reverse.iust.ac.ir

Automated refactoring of the Java code using ANTLR in Python

2022-05-02T00:30:00+04:30

Introduction

Refactoring is a type of program transformation that preserves the program’s behavior. The goal of refactoring is to improve the program’s internal structure without changing its external behavior. In this way, the program quality, defined and measured in terms of quality attributes, is improved. Researchers have recently studied the improvement of different quality attributes through refactoring (Mkaouer et al. 2016; Mohan and Greer 2019).

The refactoring process could be automated to reduce the required time and cost and increase the reliability of applied transformation. Refactoring engines are tools that automate the application of refactorings: first, the user chooses a refactoring to apply, then the engine checks if the transformation is safe and, if so, transforms the program. Refactoring engines are a key component of modern Integrated Development Environments (IDEs), and programmers rely on them to perform refactorings. The programmer need only select which refactoring to apply, and the engine will automatically check the preconditions and apply the transformations across the entire program if the preconditions are satisfied. Refactoring is gaining popularity, as evidenced by the inclusion of refactoring engines in modern IDEs such as IntelliJ IDEA, Eclipse, or NetBeans for Java.

According to Fowler (Fowler and Beck 2018), the biggest change to refactoring in the last decade is the availability of tools that support automated refactoring. Refactoring engines must be reliable. A fault in a refactoring engine can silently introduce bugs in the refactored program and lead to challenging debugging sessions. If the original program compiles, but the refactored program does not, the refactoring is obviously incorrect and can be easily undone. However, if the refactoring engine erroneously produces a refactored program that compiles but does not preserve the semantics of the original program, this can have severe consequences. Therefore, an automated refactoring tool must operate on the code’s syntax tree, not the text, to perform refactoring correctly. Manipulating the syntax tree is more reliable for preserving the program syntax, semantics, and behavior. For this reason, developing an automated refactoring tool requires deep knowledge of compiler techniques. Fowler (Fowler and Beck 2018) present a catalog of more than 70 refactoring in his book and state that

implementing decent refactoring is a challenging programming exercise—one that I am not mostly unaware of as I gaily use the tools.

Martin Fowler

ANTLR Background

Before reading this tutorial, I recommend looking at ANTLR basic tutorial where I describe the background of using ANTLR to generate and walk phase three and implement custom program analysis applications with the help of the ANTLR listener mechanism. The most important point is that we used the real-world programming languages grammars to show the parsing and analyzing process. The discussed approach forms the underlying concepts of our approach for automated refactoring. Indeed, we implement appropriate listeners that can perform the actions required to apply each refactoring. ANTLR provides TokenStreamRewriter class which can manipulate program tokens at specific indices in the program.

Using ANTLR for automating refactoring

The key to using ANTLR for refactoring tasks is the TokenStreamRewriter class that knows how to give altered views of a token stream without actually modifying the stream. It treats all of manipulation methods as “instructions” and queues them up for lazy execution when traversing the token stream to render it back as text. The rewriter executes those instructions every time we call the getText() method. This strategy is very effective for the general problem of source code instrumentation or refactoring. The TokenStreamRewriter is a powerful and extremely efficient means of manipulating a token stream.

In the remaining sections of this post, I discuss different refactoring techniques and describe the automation of most important refactoring operation in Python based on the ANTLR library.

Please note that the full implementation of the automation of any refactoring operations contains many details that are too complicated to describe here. Therefore, I only focused on the most important part of the automation process in this chapter. For the interested readers, the full implementation of discussed refactoring can be found on https://github.com/m-zakeri/CodART.

CodART is our recently developed open-source refactoring engine that has automated the application of 16 refactoring operations with ANTLR. The up-to-date documentation of CodART is available on https://m-zakeri.github.io/CodART/. Each refactoring operation has a definition and is clearly specified by the entities in which it is involved and the role of each. I ask you to look at CodART's white paper to find a decent introduction to refactoring operation.

Encapsulate field

We begin with a simple yet important refactoring, encapsulate field, which provides information hiding as one of the basic principles of the object-oriented design (Booch et al. 2008). The encapsulate field refactoring replaces all references to a field with accesses through setter and getter methods. This refactoring takes as input the name of the field to encapsulate and the names of its enclosing class. It performs the following transformations:

Creates a public getter method that returns the field’s value,
Creates a public setter method that updates the field’s value to a given parameter’s value,
Replaces all field reads with calls to the getter method,
Replaces all field writes with calls to the setter method,
Changes the field’s access modifier to private.

Figure 1 shows an example of encapsulate field refactoring for field f in class A.

Figure 1: Example EncapsulateField refactoring

To perform this refactoring automatically, we develop a listener class, EncapsulateFiledRefactoringListener, that implements the aforementioned transformations. The constructor of this class is shown in the following. The class takes an instance of CommonTokenStream class, source class name, and field identifier as input. The first parameter is used to initialize an instance of TokenStreamRewriter class which provides a set of methods to manipulate the syntax (parse) tree. The second and third parameters specify the entity to be refactored.

_version_ == '0.1.0'
_author_ == 'Morteza Zakeri'

class EncapsulateFiledRefactoringListener(JavaParserLabeledListener):
    """
    To implement the encapsulate field refactoring
    make a public field private and provide 
    accessors and mutator methods.
    """

    def __init__(self, common_token_stream: CommonTokenStream = None,
                 source_class_name: str = None,
                 field_identifier: str = None):
        """
        :param common_token_stream: contains the program tokens
        :param source_class_name: contains the enclosing class of the field
        :param field_identifier: the field name to be encapsulated 
        """
        self.token_stream = common_token_stream
        self.source_class_name = source_class_name
        self.field_identifier = field_identifier
        self.in_source_class = False

        # Move all the tokens in the source code in a buffer,
        # token_stream_rewriter.
        if common_token_stream is not None:
            self.token_stream_rewriter = \
                TokenStreamRewriter(common_token_stream)
        else:
            raise TypeError('common_token_stream is None')

The entire refactoring application is performed in four steps: First, we should check whether the parse tree walker is visiting the class that contains the given field or not. We use a flag variable in_source_class to indicate that the walker is entered into the source class. This flag is set to true when enterClassDeclaration method is called and is set back to false when exitClassDeclaration method is called:

def enterClassDeclaration(self, ctx: JavaParserLabeled.ClassDeclarationContext):
    if ctx.IDENTIFIER().getText() == self.source_class_name:
        self.in_source_class = True

def exitClassDeclaration(self, ctx: JavaParserLabeled.ClassDeclarationContext):
    self.in_source_class = False

The second step is to change the access modifier of the field from public to private. We could perform this either when entering or exiting from the fieldDeclartion rule in the Java grammar. It is required to ensure that we modify the given field, not other files in the class or program. The first and second if statements in the following code perform this check. Afterward, the replaceRange method of the token_stream_rewriter is called to replace the “public” modifier token with the “private” modifier token, shown in the following code snippet.

def enterFieldDeclaration(self, ctx: JavaParserLabeled.FieldDeclarationContext):
    if self.in_source_class:
        if ctx.variableDeclarators().variableDeclarator(
                0).variableDeclaratorId().getText() == self.field_identifier:
            if ctx.parentCtx.parentCtx.modifier(0).getText() == 'public':
                self.token_stream_rewriter.replaceRange(
                  from_idx=ctx.parentCtx.parentCtx.modifier(0).start.tokenIndex,
                  to_idx=ctx.parentCtx.parentCtx.modifier(0).stop.tokenIndex,
                  text='private')
            else:
                return

Figure 2 shows the part of the parse tree generated for the code snippet in Figure 1. The parse tree visualization help understand the logic behind the implementation of the enterFieldDeclaration method in the above code. Indeed, we write a fragment of code by observing the position of nodes in the corresponding parse tree. For example, when we enter the fieldDeclartion rule (i.e., when ANTLR calls the above method), the ANTLR runtime library provides a ctx object of class FieldDeclarationContext, which contains pointers FieldDeclaration to parent and children. These pointers allow us to move between the different nodes in the parse tree, typically around our main node, which is FieldDeclaration in our example.

Figure 2: Part of the parse tree generated for the code is snipped in Figure 1 (left).

Our goal is to change the “public” token to “private,” which is the direct child of the classOrInterfaceModifier node and the descendant of the modifier node. Therefore, we should access from FieldDeclaration node to one of the modifier or classOrInterfaceModifier nodes. The statement ctx.parentCtx.parentCtx.modifier(0) give us to the first child of the modifier node, i.e., classOrInterfaceModifier. The green arrow in Figure 2 shows how we can move from FieldDeclaration node to the classOrInterfaceModifier node.

In the third step, we add the getter and setter methods for the encapsulated field to the class body. To this aim, we define the new_code variable that holds the generated codes. The code can be generated based on a simple template that accessor and mutator methods typically follow. When the generated code is completed, it is added to the class body after the encapsulated field declaration using the insertAfter method of the token_stream_rewriter object. We use the exitFieldDeclaration method to put the described actions. The following code snippet shows generating and inserting accessor and mutator methods.

def exitFieldDeclaration(self, ctx: JavaParserLabeled.FieldDeclarationContext):
    if self.in_source_class:
        if ctx.variableDeclarators().variableDeclarator(
                0).variableDeclaratorId().getText() == self.field_identifier:
   # Check if getter or setter methods already exist
   for c in ctx.parentCtx.parentCtx.parentCtx.classBodyDeclaration():
         if c.memberDeclaration().methodDeclaration().IDENTIFIER() \
                .getText() == 'get' + str.capitalize(
            self.field_identifier):
            self.getter_exist = True

        if c.memberDeclaration().methodDeclaration().IDENTIFIER() \
                .getText() == 'set' + str.capitalize(
            self.field_identifier):
            self.setter_exist = True

  # Generate accessor and mutator methods if not exist
  # Accessor body
  new_code = ''
  if not self.getter_exist:
      new_code = '\n\t// new getter method\n\t'
      new_code += 'public ' + ctx.typeType().getText() + \
                ' get' + str.capitalize(self.field_identifier)
      new_code += '() { \n\t\treturn this.' + self.field_identifier 
             + ';' + '\n\t}\n'

  # Mutator body
  if not self.setter_exist:
      new_code += '\n\t// new setter method\n\t'
      new_code += 'public void set' + \
                str.capitalize(self.field_identifier)
      new_code += '(' + ctx.typeType().getText() + ' ' \
                + self.field_identifier + ') { \n\t\t'
      new_code += 'this.' + self.field_identifier + ' = ' \
                + self.field_identifier + ';' + '\n\t}\n'
  self.token_stream_rewriter.insertAfter(ctx.stop.tokenIndex, new_code)

The above code checks that getter and setter do not already exist. Thereafter, a new getter and setter method is added to the class body. The fourth step is to update the references of the encapsulated field to replace the field usages with the appropriate setter or getter method. There are different rules in the grammar that describe access to the class fields. A complete encapsulate field refactoring should be considered all rules. However, the code for updating the field usages with getter and setter methods is almost the same. For example, by looking at Figure 3, which is part of the code shown in Figure 1, we can find that when the right-hand side brother of node experssion1 is a binary operator such as MUL, the child of expression1 must be replaced with the getter method.

Figure 3: Part of the parse tree generated for the code is snipped in Figure 1 (left)

The following code snipped performs this transformation when the walker exit from the expression1 node.

def exitExpression1(self, ctx: JavaParserLabeled.Expression1Context):
    if self.in_source_class and self.in_selected_package:
        try:
            if ctx.parentCtx.getChild(1).getText() in 
                     ('=', '+=', '-=', '*=', '/=', '&=','|=', '^=', '>>=',
                     '>>>=', '<<=', '%=') and ctx.parentCtx.getChild(0) == ctx:
                return
        except:
            pass
        if ctx.getText() == 'this.' + self.field_identifier:
            new_code = 'this.get' + str.capitalize(self.field_identifier) + '()'
            self.token_stream_rewriter.replaceRange(ctx.start.tokenIndex,
                                                    ctx.stop.tokenIndex,
                                                    new_code)

Other places where the encapsulated field is accessed or modified should be found and updated in a similar way described in this step. Once all fourth steps described in this section are applied, the code snippet shown in Figure 1 (left) is transformed to the code snippet shown in Figure 1 (right), and the encapsulated refactoring is completed.

Conclusion and remarks

Most of the techniques described in this section can be used to automate other refactoring operations. The only different things are the required actions, which are often unique to each refactoring. The overall process consists of looking at the relevant parts of the parse tree, choosing a relevant node, and implementing the required actions.

The ctx object of the Context class contains all information we need to find and check or change when performing the refactoring. In addition, visualization of the parse tree helps choose which node can be chosen for which actions and how the actions should be programmed.

It should be noted that selecting a pares tree node (or grammar rule) to put the required actions does not have a unique and deterministic answer. In other words, we can put our actions in a set of nodes when programming with ANTLR. For example, to change the “public” token to a “private” token, one may put the required actions in the memberDeclartion node, which sightly changes our above code. The right node should be chosen that minimizes the implementation effort of that actions. As general advice, when automating refactoring operations, we write our actions on the node near the refactoring entities.

I try to explain the automation of more refactoring operation to this tutorial.

Stay hungry, stay incomplete :)

References

Booch G, Maksimchuk RA, Engle MW, et al (2008) Object-oriented analysis and design with applications, third edition. ACM SIGSOFT Softw Eng Notes 33:29–29. https://doi.org/10.1145/1402521.1413138

Fowler M, Beck K (2018) Refactoring: improving the design of existing code, Second Edi. Addison-Wesley

Do software engineers sacrifice themselves?

2021-04-05T12:00:00+04:30

I am eager to know any related comments in response to the following questions:

(Q1) What are the negative impacts and the dark side of agile developments on software engineers’ life?
(Q2) Does agile software development put too much pressure on software engineers and developers?
(Q3) What are the equivalents of agile methods in other engineering and science disciplines, such as civil engineering, chemical engineering, and medical science?
(Q4) If there is any equivalent, then how is its popularity and acceptance among the experts in that field?
(Q5) Should software developers refuse to work for employers that enforce agile methodologies? Please also help me refine the topic and questions to make something useful for the software engineers’ community.

Dark Sides of the Agile Software Development Culture

Agile software development methodologies, along with DevOps and Continuous Integration/Continuous Deployment (CICD), have revolutionized the software industry. Their advantages—such as flexibility, faster delivery, and enhanced collaboration—are widely celebrated. However, beneath this success lies a less-discussed reality: the negative impacts and challenges these methodologies impose on software engineers. This article critically examines the "dark corners" of agile practices and their implications for the software engineering community.

The Negative Impacts of Agile Development on Software Engineers' Lives

While agile methodologies emphasize adaptability and collaboration, they often lead to unintended consequences: - Burnout and Stress: The iterative nature of agile, with its constant sprints and deadlines, can create a high-pressure environment. Engineers may feel overwhelmed by the relentless pace and the expectation to deliver continuously. - Scope Creep: Agile's flexibility can result in frequent changes to project requirements, leading to extended work hours and frustration. - Erosion of Work-Life Balance: The demand for constant availability and rapid responses to changes can blur the boundaries between personal and professional life. - Reduced Creativity: The focus on delivering "working software" in short cycles may stifle innovation, as engineers prioritize immediate goals over long-term vision.

Does Agile Put Too Much Pressure on Developers?

Agile methodologies can indeed exert significant pressure on developers: - Effort-Reward Imbalance: While agile practices may enhance job satisfaction, they can also increase stress, particularly when developers perceive an imbalance between their efforts and rewards. - High Expectations: Developers often face tight deadlines and are expected to integrate new tools and practices without compromising delivery. - Team Dynamics: Agile's reliance on collaboration can be challenging in teams with varying skill levels or communication styles.

Equivalents of Agile Methods in Other Disciplines

Agile principles have inspired similar approaches in other fields: - Civil Engineering: Lean construction methods, which emphasize waste reduction and continuous improvement, share similarities with agile. - Medical Science: Agile-like frameworks are used in clinical trials to adapt to new findings and improve patient outcomes. - Chemical Engineering: Iterative design processes in product development mirror agile's incremental approach.

Popularity and Acceptance in Other Fields

The adoption of agile-like methodologies varies across disciplines: - Civil Engineering: Lean construction is gaining traction but faces resistance due to the industry's traditional mindset. - Medical Science: Agile frameworks are well-received in research settings but are less common in clinical practice. - Chemical Engineering: Iterative methods are widely accepted in R&D but are less prevalent in large-scale production.

Should Developers Refuse Agile Workplaces?

While refusing to work in agile environments may not be practical, developers can advocate for healthier practices: - Promote Sustainable Pace: Encourage employers to prioritize work-life balance and avoid overloading teams. - Seek Transparency: Push for clear communication about project goals and expectations. - Foster Collaboration: Advocate for inclusive team dynamics that respect diverse perspectives.

Program dynamic analysis with ANTLR

2021-03-30T23:45:00+04:30

Introduction

In this tutorial we describe a primary task in source code transformation, i.e., program instrumentation, which is one of the CodA features.

The task can be performed by properly applying compiler techniques, adding required code snippets at specific source code places. Instrumentation is the fundamental prerequisite for almost all dynamic analysis types. Let us begin with a simple case in which the purpose of instrumentation is to log the executed path of the program control flow graph for each execution. Consider the following C++ program used to calculate the greatest common divider (GCD) of two integers:

#include <stdio.h>
#include <iostream>

int main()
{
    int num1, num2, i, gcd;
    std::cout << "Enter two integers: ";
    std::cin >> num1 >> num2;
    for(i=1; i <= num1 && i <= num2; ++i)
    {
        // Checks if i is factor of both integers
        if(num1%i==0 && num2%i==0) 
            gcd = i;
    }
    std::cout << "G.C.D is " << gcd << std::endl;
    return 0;
}

Figure 1. Source code of GCD program.

Appropriate instrumentation will put a log statement at the beginning of each basic block. For simplicity, we add a print statement to write the number of the executed basic block in the console. In the GCD program, lines 6, 10, 13 shows the starting point of basic blocks. Therefore, the instrumented version of the GCD program is similar to the following code, in which print statement has been added manually:

#include <stdio.h>
#include <iostream>

#include <fstream>
std::ofstream logFile("log_file.txt");

int main()
{
logFile << "p1" << std::endl;
    int num1, num2, i, gcd;
    std::cout << "Enter two integers: ";
    std::cin >> num1 >> num2;
    for(i=1; i <= num1 && i <= num2; ++i)
    {
logFile << "p2" << std::endl;

        // Checks if i is factor of both integers
        if(num1%i==0 && num2%i==0) 
        {
            logFile << "p3" << std::endl;
            gcd=i;
        }
    //continue;
    }
    std::cout << "G.C.D is " << gcd << std::endl;
    //return 0;

 logFile << "p4" << std::endl;
}

Figure 2. Source code of GCD program after instrumenting.

One can see the cout statements added in lines 6, 12, 15, i.e., at the beginning of each basic block. For large programs, it is impossible to add such statements manually. To perform this instrumentation by ANTLR, we just need to identify conditional statements, including if statements, loop statements, and switch-case statements. Besides, the beginning of each function should be recognized. ANTLR provides a listener interface that consists of an enter method and exit method for each non-terminal in target language grammar. The listener can be passed to the parse tree walker used for traversing the parse tree in DFS . For instrumenting, we must implement the methods of listener interface related to conditional rules. The implementation of the listener interface in Python is shown in the following:

 class InstrumentationListener(CPP14Listener):
    def __init__(self, tokenized_source_code: CommonTokenStream):
        self.branch_number = 0
        if tokenized_source_code is not None:
      # Move all the tokens in the source code in a buffer, token_stream_rewriter. 
            self.token_stream_rewriter = TokenStreamRewriter.TokenStreamRewriter(tokenized_source_code)
        else:
            raise Exception(‘common_token_stream is None’)

    # Creating and open a text file for logging the instrumentation result at beging of the program
    def enterTranslationunit(self, ctx: CPP14Parser.TranslationunitContext):
        new_code = '\n #include <fstream> \n std::ofstream logFile("log_file.txt"); \n'
        self.token_stream_rewriter.insertAfter(ctx.start.tokenIndex, new_code)


    # DFS traversal of a statement subtree, rooted at ctx and if the statement is a branching condition 
    # insert a prob.
    def enterStatement(self, ctx: CPP14Parser.StatementContext):
        if isinstance(ctx.parentCtx, (CPP14Parser.SelectionstatementContext,
                      CPP14Parser.IterationstatementContext)):
            # if there is a compound statement after the branchning condition:
            if isinstance(ctx.children[0], CPP14Parser.CompoundstatementContext):
                self.branch_number += 1
                new_code = '\n logFile << "p' + str(self.branch_number) + '" << endl; \n'
                self.token_stream_rewriter.insertAfter(ctx.start.tokenIndex, new_code)
      # if there is only one statement after the branchning condition then create a block.
            elif not isinstance(ctx.children[0],
                                (CPP14Parser.SelectionstatementContext, CPP14Parser.IterationstatementContext)):
                self.branch_number += 1
                new_code = '{'
                new_code += '\n logFile << "p' + str(self.branch_number) + '" << endl; \n'
                new_code += ctx.getText()
                new_code += '\n}'
                self.token_stream_rewriter.replaceRange(ctx.start.tokenIndex, ctx.stop.tokenIndex, new_code)

    def enterFunctionbody(self, ctx: CPP14Parser.FunctionbodyContext):
        self.branch_number += 1
        new_code = '\n logFile << "p' + str(self.branch_number) + '" << endl;\n'
        self.token_stream_rewriter.insertAfter(ctx.start.tokenIndex, new_code)

Figure 3. ANTLR listener for instrumenting.

In the above code, class InstrumentationListener implements the interface CPP14Listener, which is the base listener for C++ grammar and generated by ANTLR. Note that the grammar of C++ 14 is available at ANTLR official website. Two methods enterStatement() and enterFunctionbody() are implemented to add a print statement in proper places of program code, respectively, at the beginning of each conditional statement and each function. These two methods are invoked by ANTLR parser tree walker if we pass an instance of InstrumnentationListerer to it.

InstrumentationListener class also has two attributes: branch_number and token_stream_rewriter. branch_umber used to track the number of instrumented blocks during the instrumentation. Each time we add a print statement, we increment the value of branch_number by one unit.

Line 3 defines branch_number and initialize it with zero. token_stream_rewiter object is an instance of TokenStreamRewiter class, which is provided by ANTLR and contain the stream of source code tokens. TokenStreamRewriter initializes with common_token_stream, which already has been built by ANTLR from the lexer class and then provides methods for adding and manipulating code snips within a given stream. Line 5 creates an instance of TokenStreamRewriter class to access its required methods. If common_token_stream is none, then an exception raises (Line 7).

Let explain the logic of enterFunctionbody() as it seems to be simpler than enterStatement(). Each time a function definition occurred in the source code, this method is invoked. First, the branch_number will be increased by 1 (Line 25). At line 26, the print statement, including the branch_number is prepared, and then at Line 27, we tell token_stream_rewiter to insert this new code after the current beginning function token, i.e., { in C++.

For adding print after conditional and loop statements, more effort is required. enterStatement() is invoked each time that a statement node is visited. Line 10 checks to see if the statement is an instance of SelectionsteatemetContext or IterationstatementContext, which are relevant rule contexts for conditional and loop statements in C++ grammar. If this condition is not valid, i.e., for regular statements, no action will perform. Otherwise, we are faced with two different situations. The first one (Line 11) is that the body of the conditional or loop statement is a compound statement, i.e., it has more than one statement, which encloses between two braces { and }.

In such a case, we just need to add our print statement at the beginning of the compound statement right after token {. The code of this condition is exactly the same code used in enterFunctionbody(). The second situation occurs when the conditional or loop statement has only one statement inside its basic block. In this state, only the first statement is considered within the condition or loop by the compiler. If one adds a print statement without any enclosing brace, the execution path will not be captured correctly. Hence, in Line 15, after detecting that the statement is neither a compound statement nor branch, the proper code will be provided. The required code for instrumenting includes a left brace, a print statement, a current statement or context, and at the end, a right brace.

Line 22 adds new_code to the current source code. Now the implementation of our InstrumentationListener has been finished. The next step is to write the main driver for the instrumentation tool and connect this listener to the parse tree walker.

Figure 4 shows the body of the main python script required to create and run our efficient yet straightforward instrumenting tool. A comment line has explained each line of code, and therefore we omit extra descriptions. The only important note is that the instrumented code, i.e., the modified source code, is accessible by token_stream_rewirter object. The getDefualtText() of token_stream_rewirter object is called to retrieve the new source code in Line 18.

from antlr4 import *

# Step 1: Convert input to a byte stream
stream = InputStream(input_string)

# Step 2: Create lexer
lexer = test_v2Lexer(stream)

# Step 3: Create a list of tokens
token_stream = CommonTokenStream(lexer)

# Step 4: Create parser
parser = test_v2Parser(token_stream)

# Step 5: Create parse tree
parse_tree = parser.start()

# Step 6: Adding a listener
instrument_listener = InstrumentationListener(common_token_stream=self.common_token_stream)

# Step 7: Create parse tree walker
walker = ParseTreeWalker()

# Step 8: Walk parse tree, attaching the listener to instrumented_programs the code
walker.walk(listener=instrument_listener, t=parse_tree)

# Step 9: 
new_source_code = instrument_listener.token_stream_rewriter.getDefaultText()
print(new_source_code)

Figure 4. The driver code for instrumenting.

After the instrumenting was completed, the program must be compiled then executed to apply the modification. Figure 5 shows an example of executing with a sample input. As shown in Figure 5, the executed path for inputs 24 and 18 are logged into the console, in addition to the program output, which is 6 in this example. The sequence of the printed path shows the order in which basic blocks were executed. We may change the instrumentation to capture more complicated information about runtime. However, the techniques and principles will be the same used in this simple example. Interested readers may find more exercise about instrumentation at the end of this chapter.

Figure 5. An example of executing the GCD program after instrumenting.

Conclusion

We show that using the ANTLR listener mechanism; it would be very simple to instrument the real-world CPP programs. A similar technique can be used to instrument the source code written in other programming languages such as Java and C#. In the next tutorial, we discuss using ANTLR for static analysis of the source code and computing some source code metrics.

Program static analysis with ANTLR

2021-03-29T23:45:00+04:30

Introduction

Source code and design metrics are extracted from the source code of the software, and their values allow us to reach conclusions about the quality attributes measured by the metrics.

A practical approach to computing such metrics is static program source code analysis. Again this analysis can be performed by using the compiler front-end to deal with the parse tree and symbol table. The idea is to create a symbol table for the program under analysis and extract desired metrics. In this section, we demonstrate the use of ANLTR to compute two essential design metrics, FANIN, and FANOUT, which affect the testability of a module.

FANIN and FANOUT can be computed from UML class diagrams. In the case of source code, we require to extract the class diagram from the program source code. We begin with constructing a simple symbol table to hold the necessary entities, e.g., classes and their relationships. Similar to our source code instrumentation tutorial, the ANTLR listener mechanism is utilized to build the symbol table. The structure of our symbol table is shown in Figure 1.

Figure 1: Class diagram of a simple symbol table for C++

The class diagram in Figure 1 has been implemented in Python. During syntax tree walking, each entity is recognized and saved in the corresponding instance of symbol table entities. For example, whenever a method is recognized, the instance of the Method class will be created to hold this method. The Model class is needed to keep a list of recognized classes as the top-level entities. The implementation code of the proposed symbol table in Python is straightforward, and we omit the code here.

The next step is creating a listener and adding codes to fill the symbol table. Listing 1 shows the listener used to recognize and add source code classes to the symbol table.

class DefinitionPassListener(CPP14Listener):
    """
    Pass 1: Extracting the classes and structs from a given CPP source code
    """

    def __init__(self, model: Model = None):
        if model is None:
            self.model = Model()
        else:
            self.model = model

        self.class_instance = None

    def enterClassspecifier(self, ctx: CPP14Parser.ClassspecifierContext):
        if self.class_instance is None:
            self.class_instance = Class()

    def exitClassspecifier(self, ctx: CPP14Parser.ClassspecifierContext):
        if self.class_instance is not None:     
            self.model.class_list.append(self.class_instance)
            self.class_instance = None

    def enterClassheadname(self, ctx: CPP14Parser.ClassheadnameContext):
        if self.class_instance is not None:
            self.class_instance.name = ctx.getText()

Listing 1: Recognizing classes in source code and inserting them into the symbol table

When an instance of DefinitionPassListner in Listing 1 is passed to the ParseTreeWalker instance, the classes within the source code are identified and inserted into the symbol table. This task has been performed only by implementing the listener methods, which correspond to the class definition rule in C++ grammar.

To better understand which methods of the base listener (CPP14Listener), generated by ANTLR, should be implemented to perform this task, we may look at the parse tree of the simple program, including one class with one field, as shown in Listing 2.

class A{
    string name;
};

Listing 2: C++ code snip with one class and one field.

The parse tree of code snip in Listing is shown in Figure 2. The parse tree visualization can be performed by the ANTLR plugin for IntelliJ IDEA. One can see the complexity of the C++ language and its compilation. The pares tree for the program with only four lines of codes has 39 nodes and more than 350 parse decisions (invocation in the recursive descent parsing), which shows that the real programming languages are too complex. Therefore, the only way to analyze and test them is to utilize compiler techniques.

Figure 2: The parse tree for the code snippet shown in Listing 2

The recognized classes, by applying DefinitionPassListner, only have a name (set in Line 25). The DefinitionPassListner listener class does not capture any required relationships for computing FANIN and FANOUT or any other analysis.

Relationships between classes in each program occurred in different ways, e.g., through the aggregation. In aggregation, one class has a field with the type of the other class. To extract the aggregation relationship, we should extract all fields whose types are user-defined. Therefore, we create another listener with the following codes:

class ResolvePassListener(DefinitionPassListener):
    """
    Pass 2: Extracting the classes' fields
    """
     def __init__(self, model: Model = None):
        super(DefinitionPassListener, self).__init__(model=model)
        self.enter_member_specification = False
        self.field = Field()

    def enterMemberspecification(self, ctx: CPP14Parser.MemberspecificationContext):
        if ctx.getChildCount() == 3:
            self.enter_member_specification = True

    def enterDeclspecifier(self, ctx: CPP14Parser.DeclspecifierContext):
        if self.enter_member_specification:
            ctx_the_type_name = ctx.getText()    
            for class_instance in self.model.class_list:
                if ctx_the_type_name == class_instance.name:
                    self.field.type = class_instance
                    self.class_instance.fields.append(field)
                    break

Listing 3: Adding class fields to the program symbol table

The method enterDeclspecifier is invoked by ParseTreeWalker each time a field is defined in the program source code. In ResolvePassListener an extra check is required to ensure that the recognized variable belongs to the class or not. The flag enter_member_specification is set to true in enterMemeberspecification method and used to understand the scope of the variable. In enterDeclspecifier method, the name of the variable is checked to find whether it is the name of another class or not. Indeed, if the field has a user-defined type, then the type of this field is resolved and added to the current class fields.

There are some practical considerations at this point. Why has a separate class defined for resolving the fields of classes? The ResolvePassListener has inherited from DefinitionPassListner, but why? The reason for separating the listener code into two classes is that the symbol table can not be completed by traversing the parse tree only once. If we try to add the field of the class at the same time that we are adding the class itself, we may not be able to find the proper type of the user-defined fields since all types still have not been inserted into the symbol table. The best practice is that two separate analysis passes are applied. One for adding types to the symbol table called definition pass, and another one for resolving the types to check or complete their information called resolved pass. Each pass in the compiling process reads the source code from start to end.

The resolve pass inherits from the definition pass since the operation in the definition pass is still required. For example, Line 20 in ResolvePassListener requires the current class when adding the recognized field to it. DefinitionPassListner, in Listing 1, is not suitable to use as a parent for ResolvePassListener. It only inserts new classes to the symbol table; however, we need to retrieve them when the ResolvePassListener is being applied. Another problem is that if the current code for DefinitionPassListner is executed more than once, the same class is inserted to self.model.class_list the object in the symbol table. We should fix the class DefinitionPassListner to solve these two problems.

First, before adding a new class (Line 25 in Listing 1), it should be checked that class has not existed in the symbol table. Second, if the class already exists in the code, in enterClassheadname method, the corresponding class should be retrieved by its name and assigned to self.class_instance object. These conditions are expected to be met when the ResolvePassListener is executed as a second pass of our analysis. Listing 4 shows the modified version of the DefinitionPassListner.

class DefinitionPassListener(CPP14Listener):
    """
    Pass 1 (modified): Extracting the classes and structs
    """

    def __init__(self, model: Model = None):
        if model is None:
            self.model = Model()
        else:
            self.model = model

        self.class_instance = None

    def enterClassspecifier(self, ctx: CPP14Parser.ClassspecifierContext):
        if self.class_instance is None:
            self.class_instance = Class()

    def exitClassspecifier(self, ctx: CPP14Parser.ClassspecifierContext):
        if self.class_instance is not None:
            if self.model.find_calss_by_name(self.class_instance) == False: 
                self.model.class_list.append(self.class_instance)


            self.class_instance = None

    def enterClassheadname(self, ctx: CPP14Parser.ClassheadnameContext):
        if self.class_instance is not None:
            if self.model.find_calss_by_name(ctx.getText()):
                self.class_instance = self.medel.get_class_by_name(class_instance.name)
            else:
                self.class_instance.name = ctx.getText()

Listing 4: The fixed version of the DefinitionPassListener class.

In this tutorial, we assumed that the input program is compilable, and hence we did not perform additional compile-time tasks such as type checking. The complete implementation of two listeners, including import statements and some additional codes, will be available on the CodA repository.

Once our listeners are completed, we can add a driver code to attach these listeners to a ParseTreeWalker and perform the target task, as discussed in the ANTLR basic tutorial. The only difference is that we have two listeners that must be executed in order to get the desired result. Listing 5 shows the driver code for our static analysis task.

stream = FileStream(input_string)
lexer = test_v2Lexer(stream)
token_stream = CommonTokenStream(lexer)
parser = test_v2Parser(token_stream)
parse_tree = parser.start()

pass1 = DefinitionPassListener()

walker = ParseTreeWalker()
walker.walk(listener=pass1, t=parse_tree)

pass2 = ResolvePassListener(model=pass1.model)
walker.walk.walk(listener=pass2, t=parse_tree)

Listing 5: Driver coed to perform static analysis of the source code.

The last step in our analysis is to build the class diagram as a directed annotated graph forming the symbol table and compute the FAN-IN and FAN-OUT metrics for each class. This step is done by creating a node for each class and adding an edge between two classes, which have an aggregate relationship together. The direction of each edge specifies the direction of aggregation.

Listing 6 shows the methods that create and visualize the discussed graph. Two methods are defined in the Model class, which was part of our symbol table in previous steps. The first method, create_graph, creates a graph for a class diagram. It uses the NetworkX library to work with graphs. The second method, draw_graph, makes visualization of the created graph. The Model class also has two fields class_list and class_diagram, which have not been shown in Listing 6. The first field holds all class instances of the source code, and the second field holds the class diagram corresponding graph.

__date__ = '2021-07-19'
__author__ = 'Morteza Zakeri'

def create_graph(self):
    class_diagram = nx.DiGraph()
        for class_instance in self.class_list:
            class_diagram.add_node(class_instance.name)

        for class_instance in self.class_list:
            if class_instance.attributes_list is not None:
                for class_attribute in class_instance.attributes_list:
                    if isinstance(class_attribute.variable_type, Class) or 
                        isinstance(class_attribute.variable_type, Structure):
                        w = 1
                        if class_diagram.has_edge(class_instance.name, class_attribute.variable_type.name):
                            w = class_diagram[class_instance.name][class_attribute.variable_type.name]['weight']
                            w += 1
                        class_diagram.add_edge(class_instance.name, class_attribute.variable_type.name,
                                               rel='Aggregation',
                                               weight=w)
        self.class_diagram = class_diagram

    def draw_graph(self):
        new_names_dict = dict()
        for node_name in self.class_diagram.nodes:
            new_names_dict.update({node_name: node_name})
        edge_labels = nx.get_edge_attributes(self.class_diagram, 'rel')
        edge_labels2 = nx.get_edge_attributes(self.class_diagram, 'cardinality')

        pos = nx.kamada_kawai_layout(self.class_diagram)
        nx.draw_networkx_nodes(self.class_diagram, pos,
                               nodelist=self.class_diagram.nodes,
                               node_shape='s',
                               node_size=1000,
                               alpha=0.25,
                               node_color='r')

        nx.draw_networkx_edges(self.class_diagram, pos,
                               edgelist=list(self.class_diagram.edges),
                               width=2.0,
                               alpha=0.95,
                               edge_color='b')
        nx.draw_networkx_edge_labels(self.class_diagram, pos, labels=edge_labels)
        nx.draw_networkx_edge_labels(self.class_diagram, pos, labels=edge_labels2)
        nx.draw_networkx_labels(self.class_diagram, pos, new_names_dict, font_size=11)

        plt.show()

Listing 6: Methods for creating and visualizing a simple class diagram.

FAN-IN and FAN-OUT can for each class are defined respectively as in-degree and out-degree of the class diagram corresponding graph. Therefore, having that graph means that we can compute these metrics quickly. To illustrate the discussed static analysis on a real program, consider the C++ program in Listing 7, which has four simple classes: Person, Student, Teacher, and Course. The implementation of classes has been omitted for simplicity. Both the Student class and Teacher class have been inherited from the class Person. In addition, the Student class has aggregated an instance of the Course class.

# include <string>
# include <iostream>
using namespace std;

class Person
{
protected:
     string firstName;
     string lastName;
     int nationalCode;
public:
    Person(string firstName, string lastName, int nationalCode);
    void setPersonName(string firstName, string lastName);
    virtual int doJob();
};

Person::Person(string firstName, string lastName, int nationalCode)
{
    this->firstName = firstName;
    this->lastName = lastName;
    this->nationalCode = nationalCode;
}

void Person::setPersonName(string firstName, string lastName)
{
    this->firstName = firstName;
    this->lastName = lastName;
}

int Person::doJob()
{
    cout << this->firstName << " is a person " << endl;
    return 0;
}

class Student: public Person
{
private:
    long studentNumber;
    Course* cource;
public:
    Student(string firstName, string lastName, int nationalCode, long studentNumber);
    int doJob() override;
};

Student::Student(string firstName, string lastName, int nationalCode, long studentNumber):Person(firstName, lastName, nationalCode)
{
    this->studentNumber = studentNumber;
    cout << "I am a student: " << this->studentNumber << endl;
    this->cource = new Course();
    this->cource->name = "Software Engieering";
}

int Student::doJob()
{
    cout << this->firstName << " is studing " << endl;
    return 20;
}

class Teacher: public Person
{
private:
    long teacherNumber;
public:
    Teacher(string firstName, string lastName, int nationalCode, long teacherNumber);
    int doJob() override;
};

Teacher::Teacher(string firstName, string lastName, int nationalCode, long teacherNumber):Person(firstName, lastName, nationalCode)
{
    this->teacherNumber = teacherNumber;
    cout << "I am a teacher: " << this->teacherNumber << endl;
}

int Teacher::doJob()
{
    cout << this->firstName << " is teaching " << endl;
    return 0;
}

class Course
{
public:
    string name;
    int number;
    Course(string course_name, int course_numbber = 0);
};

Course::Course(string course_name, int course_numbber)
{
    this->name = course_name;
    this->number = course_numbber;
}

/* main function */
int main()
{
    Teacher t1("Saeed", "Parsa", 1234, 1398);
    Student s1("Morteza", "Zakeri", 5678, 2020);
    t1.doJob();
    s1.doJob();
}

Listing 7: A C++ application to test the developed static analysis program in this tutorial.

The corresponding graph for the class diagram of this program, which is the output of executing codes in Listings 5 and 6, has been shown in Figure 3. As one can see, the inheritance relationships also have been shown in the figure. We omitted the code that captures the inheritance relationship in this section. You may ask to implement the extraction of inheritance relationships after reading this tutorial.

Figure 3: Class diagram for the program shown in Listing 7.

FAN-IN and FAN-OUT metrics can be computed, as discussed earlier. For this simple example, FAN-IN for class Student is 0, and FAN-OUT is one; however, for complete computation of these metrics, all relationships, including association, dependencies, and parameters passing, should be considered.

Summary

In this tutorial and the previous one, I discussed the application of compilers in static and dynamic software analysis. I demonstrated these applications through a simple example of source code instrumentation and metrics computation. The former is a transformation task that modifies the source code, and the latter is an analysis task that extracts some information from the source code. Both of them are essential tasks in the future of software engineering.

Systematic software testing and quality assurance tools can be built on top of compiler tools such as ANTLR, LLVM, JDT, and Roslyn, with techniques presented in this chapter. Compilers build a detailed model of application code as they validate the syntax and semantics of that code. While traditional compilers used such a model to build the executable output from the source code in a block box manner, the new generation of compilers provides APIs to access the internal details of this model, which can be utilized to build more reliable software. Software testing is more realistic with advanced support by compilers.

Advanced Software Engineering

2021-03-23T00:23:00+04:30

Foreword

The AUT advanced software engineering (ASE) course aims at teaching the latest and emerging topics and advances in the field of software engineering to the students who are already familiar with basic subjects in the field. Here, I will share relevant materials and resources with you.

Course Description

This course delves into the latest trends, emerging topics, and cutting-edge advances in the field of software engineering. Designed for students with a foundational understanding of software engineering concepts, the course covers innovative methodologies, tools, and frameworks that are shaping the modern software development landscape. Through in-depth case studies, hands-on projects, and discussions on real-world challenges, students will explore how to drive innovation and implement advanced practices in software engineering.

Course Objectives

By the end of this course, students will: 1. Gain familiarity with the latest research and advancements in software engineering.

Learn to adopt and adapt emerging methodologies and frameworks for large-scale software systems.
Understand the impact of intelligent tools and automation in the software development lifecycle (SDLC).
Explore advanced topics like microservices, cloud-native development, and software observability.
Critically assess and integrate evolving practices to solve modern software engineering challenges.

Syllabus

Week 1-2: Introduction to Advanced Software Engineering

Overview of modern challenges in software engineering
Emerging trends and technologies in the field
Recap of fundamental principles and frameworks

Week 3-4: Continuous Integration, Deployment, and Delivery

CI/CD pipelines and automation tools
Best practices for seamless software deployment
Ensuring quality and reliability in fast-paced development

Week 5-6: DevOps and Cloud-Native Software Engineering

Principles of DevOps culture and practices
Cloud-native development and containerization with Docker and Kubernetes
Serverless architectures and their applications

Week 7-8: Software Observability and Resilience

Monitoring, logging, and tracing software systems
Building resilient applications with fault tolerance
Best practices for incident response and root cause analysis

Week 9-10: Intelligent Tools and Automation in Software Engineering

Machine learning applications in the SDLC
Automated code generation, testing, and debugging
Intelligent software refactoring and evolution

Week 11-12: Advanced Software Architectures

Exploring microservices and event-driven architectures
Designing scalable, maintainable, and secure systems
Understanding domain-driven design (DDD)

Week 13: Ethical and Societal Aspects of Software Engineering

Understanding the social and ethical implications of modern software
Discussing data privacy, security, and sustainability in engineering practices

Week 14: Capstone Project and Review

Applying advanced concepts to design and implement a solution to a real-world problem
Final presentations and peer reviews

Course Assessment

Assignments (25%): Hands-on tasks focusing on advanced tools and techniques.
Paper-based Exam (40%): Evaluation of theoretical understanding of emerging topics.
Capstone Project (25%): Team-based project addressing real-world challenges in software engineering.
Participation (10%): Contribution to discussions, peer reviews, and active engagement in the course.

Resources

Textbooks:
Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation by Jez Humble and David Farley
Building Microservices: Designing Fine-Grained Systems by Sam Newman
Online Platforms:
Cloud platforms such as AWS, Azure, or Google Cloud for practical exercises
Tools like Jenkins, Docker, Kubernetes, and Prometheus
Research Papers:
Recent publications on software engineering from top conferences like ICSE and ASE

Prerequisites

A solid foundation in software engineering principles.
Familiarity with basic software development methodologies and tools.

Contact Information

For inquiries, feel free to reach out via my webpage: www.m-zakeri.github.io.

This structure ensures clarity and provides a robust roadmap for your Advanced Software Engineering course. Let me know if you'd like to refine any part!

Course history

Teaching assistant

I was teaching assistant of Advanced Software Engineering M.Sc. and Ph.D. course by Dr. Saeed Parsa for six semesters at Iran University of Science and Technology. Our teaching materials during these three years are available to view and download.

Useful links

Course official website

Compilers

2021-03-23T00:23:00+04:30

Provide an in-depth understanding of the principles, techniques, and tools used in the design and implementation of compilers.
Equip students with the ability to write and optimize programs by understanding how high-level code is translated into machine-level instructions.
Introduce concepts of language parsing, code generation, and optimization techniques that are critical for the development of efficient and reliable software.
Foster the ability to analyze and improve existing compiler systems, as well as design new ones from scratch.

Objectives

By the end of this course, students will be able to: 1. Understand the fundamental phases of a compiler, including lexical analysis, syntax analysis, semantic analysis, code generation, and optimization. 2. Develop skills to implement key components of a compiler, such as parsers, symbol tables, and intermediate code generators. 3. Recognize and utilize various parsing techniques, including top-down and bottom-up parsing strategies. 4. Apply optimization techniques to improve the performance of compiled code. 5. Gain exposure to modern tools and technologies used in compiler construction, such as Lex and Yacc.

Syllabus

1. Introduction to Compiler Design

Role of a compiler in the software development lifecycle
Overview of programming languages and their features
Architecture of a compiler

2. Lexical Analysis

Lexical tokens, patterns, and lexemes
Design of lexical analyzers
Introduction to Lex and its usage

3. Syntax Analysis

Context-free grammars and parse trees
Top-down parsing (recursive descent, LL) and bottom-up parsing (LR, LALR)
Error detection and recovery mechanisms in parsers

4. Semantic Analysis

Role of semantic analysis in a compiler
Type checking and type inference
Symbol tables and attribute grammars

5. Intermediate Code Generation

Representations such as three-address code, quadruples, and abstract syntax trees
Translating high-level constructs into intermediate representations

6. Code Optimization

Local vs global optimization techniques
Loop optimizations, dead code elimination, and inlining
Role of data-flow analysis in optimization

7. Code Generation

Instruction selection and register allocation
Basic blocks and flow graphs
Generating efficient machine-level code

8. Advanced Topics

Just-In-Time (JIT) Compilation
Compiler frameworks (LLVM, GCC)
Introduction to domain-specific languages (DSLs)

9. Project and Case Studies

Design and implementation of a small-scale compiler
Case studies on modern compiler systems and frameworks

Teaching assistant

Foreword

I was teaching assistant of Compiler Design and Construction B.Sc. course by Dr. Saeed Parsa for seven semesters (more than three years) at Iran University of Science and Technology. Our teaching materials during these three years are available to view and download.

I put all source code that I developed to practically teach compiler to students on the GitHub page, the IUST compiler course.

Useful links

Projects

I designed and planned some practical projects about the applications of compiler science in program analysis. The projects shown in Table 1 have been assigned to the students who take the IUST compiler course during different semesters. Click on the link in the "Project" column to see the project proposal.

Table 1: Compiler projects.

Project	Description	Semesters	Courses
OpenUnderstand 2	Low-level source code metrics calculation	Spring 2022	Compiler
OpenUnderstand	Symbols table development	Fall 2021, Spring 2022,	Compiler
QualityMeter	- Source code quality attribute computation - Refactoring opportunity detection	Fall 2021	Advanced compiler
CodART 2	Source code smell detection	Spring 2021 (Cancelled)	Compiler
CodART	Source code refactoring	Fall 2020, Spring 2021,	Compiler
CodART	Refactoring to design pattern at the source code level	Fall 2020	Advanced compiler
CleanCode	Source code smell detection	Fall 2019, Spring 2020	Compiler
CodA	Source code instrumentation and testbed analysis tool	Fall 2018	Compiler / Advanced compiler
ANTLR MiniJava	Parse-tree and intermediate code generation for the MiniJava programming language with ANTLR	Fall 2016, Spring 2017	Compiler

As a student

I always enjoy learning about compilers, code transformation, and their application in automated software engineering. I firmly believe that the next generation of software engineers are intelligent white-box compilers! Such compilers are structure-aware, context-aware, and domain-aware, assisting the programmer in writing high-quality and testable programs. Compilers helped artificial intelligence (AI) in the past, and now AI boosts compilers!

Patterns and Principle in Software Engineering

2021-03-23T00:23:00+04:30

Course Description

This course explores the foundational principles and design patterns that underpin modern software engineering. It aims to equip students with a deep understanding of reusable solutions, architectural patterns, and best practices to enhance the quality, maintainability, and scalability of software systems. Through a mix of theoretical concepts and practical applications, students will learn how to apply these principles effectively in real-world scenarios.

Course Objectives

By the end of this course, students will: 1. Understand core software design principles such as SOLID, DRY, and KISS.

Analyze and apply common software design patterns including Creational, Structural, and Behavioral patterns.
Develop skills to architect scalable and maintainable software systems.
Explore principles of software evolution, refactoring, and maintainability.
Gain insights into modern software development practices, including agile development and DevOps principles.

Syllabus

Week 1-2: Introduction to Software Engineering Principles

Overview of Software Engineering
Key concepts: Scalability, Maintainability, and Modularity
Introduction to SOLID Principles and Object-Oriented Design

Week 3-4: Design Patterns - An Overview

What are Design Patterns?
Types of Patterns: Creational, Structural, Behavioral
Case studies of practical design pattern applications

Week 5-6: Creational Patterns

Singleton, Factory, Abstract Factory, Builder, Prototype
Practical examples and coding exercises

Week 7-8: Structural Patterns

Adapter, Bridge, Composite, Decorator, Facade, Flyweight, Proxy
Designing software with structural patterns

Week 9-10: Behavioral Patterns

Chain of Responsibility, Command, Interpreter, Iterator
Mediator, Memento, Observer, State, Strategy, Template, Visitor

Week 11: Architectural Patterns

MVC (Model-View-Controller), MVP, MVVM
Microservices, Event-Driven Architecture

Week 12: Software Evolution and Refactoring

Principles of software refactoring
Techniques to improve existing codebases while maintaining functionality
Tools and practices for code quality improvement

Week 13: Patterns in Agile Development and DevOps

Continuous Integration and Continuous Deployment (CI/CD) Practices
Design patterns supporting agile development
Design for testability and automation

Week 14: Capstone Project and Review

Applying learned patterns and principles to a real-world project
Final presentations and feedback

Course Assessment

Assignments (25%): Weekly exercises on applying patterns.
Paper-based Exam (40%): Theoretical and practical understanding of design principles.
Capstone Project (25%): Collaborative project designing a software system.
Participation (10%): Engaging in discussions and code reviews.

Resources

Textbooks:
Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma et al.
Clean Architecture: A Craftsman’s Guide to Software Structure and Design by Robert C. Martin
Online Platforms:
GitHub for version control and collaboration
IDEs like IntelliJ IDEA or Visual Studio Code for coding exercises
Additional Resources:
Documentation for tools and frameworks used in the course
Research papers on software architecture and design

Prerequisites

Basic understanding of programming concepts.
Familiarity with object-oriented programming (OOP) and data structures.

Contact Information

For questions, reach out via my webpage: www.m-zakeri.github.io.

An introduction to ANTLR in Python

2021-03-22T23:00:00+04:30

Background

The ANTLR tool generates a top-down parser from the grammar rules defined with the ANTLR meta-grammar (Parr and Fisher 2011). The initial version of ANTLR generated the target parser source code in Java. In the current version (version 4), the parser source code can be generated in a wide range of programming languages listed on the ANTLR official website (Parr 2022a). For simplicity, we generate the parser in Python 3, which provides us to run the tool on every platform having Python 3 installed on it. Another reason to use Python is that we can integrate the developed program easily with other libraries available in Python, such as machine learning and optimization libraries. Finally, I found that there is no comprehensive tutorial on using ANTLR with the Python backend.

To use ANTLR in other programming languages, specifically Java and C#, refer to the ANTLR slides I created before this tutorial.

The ANTLR tool is a small “.jar” file that must be run from the command line to generate the parser codes. The ANTLR tool jar file can be downloaded from here.

Generating parser

As mentioned, to generate a parser for a programming language, the grammar specification described with ANTLR meta-grammar is required. ANTLR grammar files are named with the “.g4” suffix.

We obtain the grammar of Java 8 to build our parser for the Java programming language. The grammar can be downloaded from ANTLR 4 grammar repository on GitHub: https://github.com/antlr/grammars-v4. Once the ANTLR tool and required grammar files are prepared, we can generate the parser for that with the following command:

> java -Xmx500M -cp antlr-4.9.3-complete.jar org.antlr.v4.Tool -Dlanguage=Python3 -o . JavaLexer.g4

> java -Xmx500M -cp antlr-4.9.3-complete.jar org.antlr.v4.Tool -Dlanguage=Python3 -visitor -listener -o . JavaLabeledParser.g4

The first command generates the lexer from the JavaLexer.g4 description file and the second command generates the parser from the JavaLabeledParser.g4 description file. It is worth noting that the lexer and parser can be written in one file. In such a case, a single command generates all required codes in one step.

The grammar files used in the above command are also available in grammars directory of the CodART repository. You may see that I have made some modifications to the Parser rules.

In the above commands, the antlr-4.9.3-complete.jar is the ANTLR tool that requires Java to be executed. -Dlanguage denotes the destination language that the ANTLR parser (and lexer) source code is generated in which. In our case, we set it to Python3.

After executing the ANTLR parser generation commands, eight files, including parser source code and other required information, are generated. Figure 1 shows the generated files. The “.py” contains lexer and parser source code that can parse any Java input file. The -visitor -listener switches in the second command result in generating two separate source files, JavaLabledParserListener.py and JavaLabledParserVistor.py, which provide interfaces to implement the required codes for a specific language application. Our application is source code refactoring which uses the listener mechanism to implement necessary actions transforming the program to the refactored version. The parse tree structure in and listener mechanism are discussed in the next sections.

Figure 1. Generated files by ANTLR.

It should be noted that to use the generated classes in Figure 1, for developing a specific program, we need to install the appropriate ANTLR runtime library. For creating ANTLR-based programs in Python, the command pip install antlr-python3-runtime can be used. It installed all runtime dependencies required to program using the ANTLR library.

ANTLR parse tree

The generated parser by ANTLR is responsible for parsing every Java source code file and generating the parse tree or designating the syntax errors in the input file. The parse tree for real-world programs with thousands of lines of code has a non-trivial structure. ANTLR developers have provided some IDE plugins that can visualize the parse tree to better understand the structure of the parse tree generated by ANTLR. We use Pycharm IDE developed by Jetbrains to work with Python code.

Figure 2 shows how we can install the ANTLR plugin in PyCharm. The plugin source code is available on the GitHub repo. When the plugin is installed, the ANTLR preview widow is applied at the bottom of the PyCharm IDE. In addition, the IDE can be recognized as “.g4” files and some other options added to the IDE. The main option is the ability to test a grammar rule and visualize the corresponding parse tree to that rule.

Figure 2. Installing the ANTLR plugin in the PyCharm IDE.

In order to use the ANTLR preview tab, the ANTLR grammar should be opened in the PyCharm IDE. We then select a rule (typically the start rule) of our grammar, right-click on the rule, and select the “Test Rule rule_name” option from the opened menu, shown in Figure 3. We now write our sample input program in the left panel of the ANTLR preview, and the parse tree is shown in the right panel.

Figure 3. Test the grammar rule in the ANTLR PyCharm plugin.

Figure 4 shows a simple Java class and the corresponding parse tree generated by the ANTLR. The leaves of the parse tree are program tokens, while the intermediate nodes are grammar rules that the evaluating program is derived from them. Also, the root of the tree is the grammar rule, which we selected to start parsing. It means that we can select and test every rule independently. However, a complete Java program can only parse from the start rule of the given grammar, i.e., the compilaionUnit rule.

Figure 4. Test the grammar rule in the ANTLR PyCharm plugin.

It should be mentioned that the ANTLR Preview window is based on a grammar interpreter, not on the actual generated parser described in the previous section. It means that grammar attributes such as actions and predicates will not be evaluated during live preview because the interpreter is language agnostic. For the same reasons, if the generated parser and/or lexer classes extend a custom implementation of the base parser/lexer classes, the custom code will not be run during the live preview.

In addition to the parse tree visualization, the ANTLR plugin provides facilities such as profiling, code generation, etc., described in here (Parr 2022b). For example, the profile tab shows the execution time of each rule in the parser for a given input string.

I want to emphasize that visualizing the parse tree with the ANTLR plugin is really helpful when developing code and fixing bugs described in the next section of this tutorial.

Traversing the parse tree programmatically

ANTLR is not a simple parser generator. It provides a depth-first parse tree visiting and a callback mechanism called listener to implement the required program analysis or transformation passes. The depth-first search is performed by instantiating an object from the ANTLR ParseTreeWalker class and calling the walk method, which takes an instance of ParseTree as an input argument and traverses it.

Obviously, if we visit the parse tree with the depth-first search algorithm, all program tokens are visited in the same order that they appeared in the source code file. However, the depth-first search contains additional information about when a node in the tree is visited and when the visiting all nodes in its subtree is finished. Therefore, we can add the required actions when visiting a node to perform a special task. For example, according to Figure 4, for counting the number of classes in a code snippet, we can define a counter variable, initialize it to zero, and increase it whenever the walker visits the “classDeclartion” node.

ANTLR provides two callback functions for each node in the parse tree. One is called by the walker when it is entered into a node, i.e., visit the node, but the children are not visited yet. Another is called when all nodes in the subtree of the visited node have been visited, and the walker is exiting the node. These callback functions are available in the listener class generated by the ANTLR for every rule in a given grammar. In our example for counting the number of classes, we implement all required logic in the body of enterClassDeclartion method of the JavaLabledParserListener class. We called these logic codes grammar’s actions since, indeed, they are bunded to a grammar rule.

It is worth noting that we can add these actions codes in the grammar file (.g4 file) to form an attributed grammar. Embedding actions in grammar increase the efficiency of the analyzing process. However, when we need many complex actions, the listener mechanism provides a better way to implement them. Indeed, ANTLR 4 emphasizes separating the language applications from the language grammar by using the listener mechanism.

Listing 1 shows the implementation program for counting the number of classes using the ANTLR listener mechanism. The DesignMetrics class inherits from JavaLabeledParserListener class which is the default listener class generated by ANTLR. We only implement the enterClassDeclartion method, which increases the value of the __dsc counter each time the walker visits a Java class.

# module: JavaLabledParserListener.py

__version__ = "0.1.0"
__author__ = "Morteza"

from antlr4 import *
if __name__ is not None and "." in __name__:
    from .JavaLabeledParser import JavaLabeledParser
else:
    from JavaLabeledParser import JavaLabeledParser

class JavaLabeledParserListener(ParseTreeListener):
    # …
    def enterClassDeclaration(self,
                              ctx:JavaLabeledParser.ClassDeclarationContext):
        pass
    # …

class DesignMetrics(JavaLabeledParserListener):
    def __init__(self):
        self.__dsc:int = 0  # Keep design size in classes

    @property
    def get_design_size(self):
        return self.__dsc

    def enterClassDeclaration(self,
                              ctx:JavaLabeledParser.ClassDeclarationContext):
        self.__dsc += 1

Listing 1: Programs that count the number of classes in a Java source code.

Wiring the modules

To complete our simple analysis task, first, the parse tree for a given input should be constructed. Then, the DesignMetrics class should be instantiated and passed to an object of ParseTreeWalker class. We created a driver module in Python beside the generated code by ANTLR to connect different parts of our program and complete our task. Listing 2 shows the implementation of the main driver for a program that counts the number of classes in Java source codes.

# Module: main_driver.py

__version__ = "0.1.0"
__author__ = "Morteza"
from antlr4 import *
from JavaLabledLexer import JavaLabledLexer
from JavaLeabledParser import JavaLabledParser
from JavaLabledParserListener import DesignMetrics

def main(args):

    # Step 1: Load input source into the stream object
    stream = FileStream(args.file, encoding='utf8')

    # Step 2: Create an instance of AssignmentStLexer
    lexer = JavaLabledLexer(stream)

    # Step 3: Convert the input source into a list of tokens
    token_stream = CommonTokenStream(lexer)

    # Step 4: Create an instance of the AssignmentStParser
    parser = JavaLabledParser(token_stream)

    # Step 5: Create parse tree
    parse_tree = parser.compilationUnit()

    # Step 6: Create an instance of DesignMetrics listener class
    my_listener = DesignMetrics()

    # Step 7: Create a walker to traverse the parse tree and callback our listener
    walker = ParseTreeWalker()
    walker.walk(t=parse_tree, listener=my_listener)

    # Step 8: Getting the results
    print(f'DSC={my_listener.get_design_size}')

Listing 2: Main driver module for the program in Listing 1

Conclusion and remarks

In this tutorial, we described the basic concepts regarding using the ANTLR tool to generate and walk phase three and implement custom program analysis applications with the help of the ANTLR listener mechanism. The most important point is that we used the real-world programming languages grammars to show the parsing and analyzing process. The discussed topics form the underlying concepts of our approach for automated refactoring used in CodART. Check out the ANTLR advanced tutorial to find out how we can use ANTLR for reliable and efficient program transformation.

References

Parr T ANTLR (ANother Tool for Language Recognition). https://www.antlr.org. Accessed 10 Jan 2022a

Parr T IntelliJ Idea Plugin for ANTLR v4. https://github.com/antlr/intellij-plugin-v4. Accessed 10 Jan 2022b

Parr T, Fisher K (2011) LL(*): the foundation of the ANTLR parser generator. Proc 32nd ACM SIGPLAN Conf Program Lang Des Implement 425–436. https://doi.org/http://doi.acm.org/10.1145/1993498.1993548

Innovations on Automatic Test Data Generation

2021-03-22T23:00:00+04:30

Fuzz testing (Fuzzing) is a dynamic software testing technique. In this technique with repeated generation and injection of malformed test data to the software under test (SUT), we are looking for possible faults and vulnerabilities. To this goal, fuzz testing requires varieties of test data. The most critical challenge is to handle the complexity of the file structures as program input. Surveys have revealed that many of the generated test data in these cases follow restricted numbers and superficial paths, because of being rejected by the parser of SUT in the initial stages of parsing. Using the grammatical structure of input files to generate test data lead to increase code coverage. However, often, the grammar extraction is performed manually, which is a time consuming, costly and error-prone task.

Recently, we proposed an automated method for hybrid test data generation. We applied neural language models (NLMs) that are constructed by recurrent neural networks (RNNs). The proposed models by using deep learning techniques can learn the statistical structure of complex files and then generate new textual test data, based on the grammar, and binary data, based on mutations. Fuzzing the generated data is done by two newly introduced algorithms, called neural fuzz algorithms that use these models. We use our proposed method to generate test data, and then fuzz testing of MuPDF complicated software which takes portable document format (PDF) files as input. To train our generative models, we gathered a large corpus of PDF files. Our experiments demonstrate that the data generated by this method leads to an increase in the code coverage, more than 7.0%, compared to state-of-the-art file format fuzzers such as American fuzzy lop (AFL). Experiments also indicate a better learning accuracy of simpler NLMS in comparison with the more complicated encoder-decoder model and confirm that our proposed models can outperform the encoder-decoder model in code coverage when fuzzing the SUT.

Out paper presents a solution for complex test data generation with the help of deep neural networks. The word “complex” in this context means that test data consist of various data types gathering together based on a specific format or grammar. This is what happens in most of the real-word applications which accept a file as their main input. For example, PDF reader software must handle PDF files as input, and PDF is one of the most complex input formats. A PDF file contains both textual and non-textual or binary data plus many human-defined rules which put such data fields beside each other and generate a file. To handle a complex file, an application processes the file in different stages. These stages usually begin by parsing the input file, then continue with semantic analysis, and finally terminate by executing the file content. Generating test data, here a complex input file, which can access to high code coverage and find more probably existing bugs, require that test data pass all of these stages. To the best of our knowledge, the methods used in fuzzing, one of the most effective software testing technics, show that randomly generating such test data lead to very low coverage of code and hence can not guarantee the absence of bugs or reliability of software. On the other hand, generating test data from grammar requires extracting the grammar or model of the file manually, which is expensive and time-consuming.

Challenges and Solutions

The problem of generating complex test data that successfully pass different stages in the processing of the file is addressed by using some machine learning techniques to learn the structure of a given input file and then generates some new test data based on the learned model. A file can be seen as a sequence of bytes which generate by a grammar of that file. Hence, we can use a language model to automatically learn this grammar form a corpus containing various samples of the given file. Neural language models are effective models to learn natural language properties and successfully are utilized in the complex natural language processing (NLP) tasks such as machine translation and image captioning. We apply a model based on deep neural language to learn the grammar of the file. The learned model is then sampled to generate new files as test data.

The first problem we encountered was finding a mechanism to distinguish between data and meta-data. To do so, we applied a reasonable trick: As meta-data repeated in almost every sample of a file format, the learned model predicts the meta-data with higher probability in comparison to data. By putting a threshold at a model output, the data and metadata are distinguishable.

Aims to seek bugs in software by fuzzing techniques, the second problem was how to determine which byte should be fuzzed to reveal failures in the software under test (SUT). This can be done by targeting different stages of input processing that SUT used them. For example, if we would like to fuzz parser, we should fuzz meta-data because parser usually deals with meta-data to validate the format of the input file and to extract data. On the other hand, if we would like to fuzz the execution stage, it may be better to fuzz data. The ability of the learned model in distinguishing data and meta-data is used to determine the place of fuzz in the file.

In addition to determining the place of the fuzz, we should inform the value for replacing the third problem we addressed. As we know, the goal of fuzzing is creating the malformed input, and hence the most inappropriate byte is expected to put in the place of fuzz. Again most inappropriate byte can be determined by the learned neural language model. It is enough to select a byte with the lowest likelihood instead of the highest likelihood used in the default manner.

The fourth problem that raised in this regard was training a neural language model on an ASCII character set rather than training it on all bytes. To deal with non-ASCII bytes that make non-textual parts of the input, we replaced these parts by a specific token, called Binary Token, and asked the model to learn that token. At the generating time, whenever a model predicted the specific binary token, we replace the binary token with a real binary section previously deleted form file. This is a simple but effective method to reach a hybrid test data generation scheme. Two specific fuzzing algorithms, MetadataNeuralFuzz and DataNeuralFuzz, are proposed based on a learned generative model. The former targets the parsing stage, and the latter focuses on rendering or executions stage in the processing of the input files. We believe that both algorithms are required to reach a complete fuzz testing with high code coverage and probably a high number of discovered bugs. We investigate the effectiveness of various language models with different configurations and several sampling strategies in the context of complex test data generation. Also, we study various parameters required when generating and fuzzing test data with deep learning techniques. Tools and Publications

To bring our new theory to a practical tool, we designed and implemented IUST-DeePFUzz as a modular file format fuzzer. The main module of IUST DeepFuzz is a test data generator that implements our neural fuzz algorithms. The fuzzer injects test data to SUT and checks for unexpected results such as crash the memory of the SUT. IUST DeepFuzz uses Microsoft Application Verifier, a free runtime monitoring tool, as a monitoring module to catch any memory corruption. It also uses VSPerfMon, another tool from Microsoft, to measure code coverage. Modules are connected using modest Python and batch scripts. IUST-DeepFuzz can find both the place and the value of the fuzzed symbol automatically while generating the input. Other file formats such as HTML, CSS, XML, JSON, and all types of source codes can be produced in the same manner which is suitable for fuzzing and quick quality assurance of any software systems.

For more information about both the theoretical and practical aspect of IUST-DeepFuzz, refer to the IUST-DeepFuzz website.

Dynamic Complex Network

2020-03-04T21:12:00+03:30

Course Description

This course provides an in-depth exploration of complex dynamic networks, focusing on their structure, behavior, and applications across various domains. Students will study the mathematical foundations, analysis techniques, and real-world implications of network dynamics. By examining interconnected systems, ranging from social networks to biological and technological networks, students will gain the tools necessary to model, analyze, and optimize network-based systems in the real world.

Course Objectives

By the end of this course, students will: 1. Understand the fundamental concepts and theories of network science.

Analyze and characterize the dynamics of complex networks, including robustness, scalability, and efficiency.
Apply mathematical and computational models to study network behavior and emergent phenomena.
Explore the role of networks in diverse applications, including communication systems, social systems, biology, and finance.
Develop practical skills in modeling and simulating dynamic networks using advanced tools and frameworks.

Syllabus

Week 1-2: Fundamentals of Complex Networks

Introduction to network science
Graph theory basics: nodes, edges, adjacency matrices
Types of networks: random, small-world, scale-free, and multiplex

Week 3-4: Network Structure and Properties

Key network metrics: degree distribution, centrality, clustering coefficient
Community detection and modularity
Real-world network structures and their implications

Week 5-6: Dynamic Processes on Networks

Diffusion and spreading processes (e.g., information, diseases)
Synchronization and collective behavior
Contagion models in social and biological networks

Week 7-8: Robustness and Resilience of Networks

Vulnerability analysis and fault tolerance
Cascading failures in critical infrastructure networks
Designing robust and resilient networked systems

Week 9-10: Network Optimization and Control

Controllability of complex networks
Optimizing network flows and resource allocation
Applications in transportation and communication systems

Week 11-12: Advanced Topics in Network Science

Temporal networks and evolving structures
Multilayer and interconnected networks
Data-driven approaches to network analysis

Week 13: Applications of Complex Dynamic Networks

Social networks and influence propagation
Biological and ecological networks
Applications in finance, power grids, and smart cities

Week 14: Capstone Project and Review

Designing, simulating, and analyzing a complex network for a chosen domain
Final presentations and peer reviews

Course Assessment

Assignments (25%): Analytical exercises and computational tasks.
Papsr-based Exam (40%): Theoretical understanding of network concepts.
Capstone Project (25%): Team-based project involving real-world network analysis.
Participation (10%): Active engagement in discussions, workshops, and reviews.

Resources

Textbooks:
Networks: An Introduction by Mark Newman
Dynamical Processes on Complex Networks by Alain Barrat et al.
Online Platforms:
Software tools such as Gephi, NetworkX (Python), and Cytoscape for network modeling and analysis
Online datasets for real-world network examples
Research Papers:
Recent publications on network science and dynamic systems from top journals

Prerequisites

A basic understanding of graph theory and linear algebra.
Familiarity with programming and data analysis tools.

Contact Information

For inquiries, feel free to reach out via my webpage: www.m-zakeri.github.io.

Course history

Teaching assistant

I was teaching assistant of Dynamic Complex Network M.Sc. and Ph.D. course by Dr. Hossein Rahmani for one semester (Winter and spring 2020) at Iran University of Science and Technology. Our teaching materials during these two years are available to view and download.

Useful links

Game Theory

2020-03-04T21:12:00+03:30

The course discusses the fundamentals of game theory. Game theory is the study of mathematical models of strategic interactions among rational agents. It has applications in all fields of social science and logic, economics, systems science, and computer science. Initially, it addresses two-person zero-sum games, in which each participant's gains or losses are exactly balanced by those of other participants. Game theory has recently found many applications in formulating network resource allocation problems and coordinating network entities' behavior to achieve a stable operating point with global consensus property.

Teaching assistant

Foreword

I was teaching assistant of Game Theory M.Sc. and Ph.D. course by Dr. Vesal Hakami for one semester (Winter and spring 2020) at Iran University of Science and Technology. Here, our teaching materials during these two years are available to view and download

Useful links

WordPress for beginning

2019-04-02T02:00:00+04:30

Welcome to our WordPress essential training for beginners: WordPress for Beginning. WordPress is a free and open-source content management system (CMS) based on PHP programming language and MySQL database. Features include a plugin architecture and a template system. It is most associated with blogging but supports other types of web content including more traditional mailing lists and forums, media galleries, and online stores. Used by more than 60 million websites, including 30.6% of the top 10 million websites, make it a popular CMS and web framework. The following video tutorials help you start building your own website by WordPress. Tutorials are available in Persian language.

Video tutorials

Part 0: Introduction and syllabus

Part 1: Internet and world wide web (www)

Part 2: Content management systems (CMSs)

Part 3: Installing and configuring WordPress on cPanel

Coming soon

Part 4: Working with WordPress admin panel (dashboard)

Coming soon

Part 5: Plugin management

Coming soon

Part 6: Media management

Coming soon

Part 7: User management

Children and programming

2019-03-10T21:13:00+03:30

"Children are a free resource of positive energy."

I love teaching computer programming to children. For many obvious reasons, I strongly believe that the 21st century is the century of CODE. But where should we start from? What can we do for our 21st century kids?

There are many ways to teach programming to a kid. I have begun with Scratch programming tool to teach some children the basic of computer programming. Typically, children love computer games and enjoy playing them. However, I found that they enjoy more to learn how computer games could be made.

Scratch is a block-based visual programming language targeted primarily at children. Scratch is used as the introductory language because creation of interesting programs is relatively easy, and skills learned can be applied to other programming languages such as Python and Java. Although Scratch’s main user age group is 8–18 years of age, Scratch has been created for educators and parents.

You can see the result teaching programming to my niece, Yasin, in these videos tutorials, prepared by him.

A survey of sequence-to-sequence learning with neural networks

2019-02-22T12:30:00+03:30

دانشـکده مهندسی کامپیوتر

یـادگیری توالی‌به‌توالی با شبکه‌های عصبی

مرتــضی ذاکـری (M - Z A K E R I [ A T ] L I V E [ D O T ] C O M)

پیاده‌سازی + مجموعه داده - فایل ZIP (حجم 18.50MB) بهمن 1396
فاز سوم (نهایی) - نسخه PDF (حجم 1.80MB) بهمن 1396
ارایـه (اسلاید (حجم 1.92MB) | ویدئـو (حجم 53.50MB)) دی‌ 1396
فاز دوم - نسخه PDF (حجم 1.75MB) آذر 1396
فاز اول - نسخه PDF (حجم 1.23MB) آبان‌ 1396
تصاویر - فایل ZIP (حجم 2.22MB)
مرجع اصلی (حجم 165KB)
تاریخ آخرین بروزرسانی: 19 - 11 - 1396

چکیده

یادگیری ژرف شاخه‌ای نسبتا جدید از یادگیری ماشین است که در آن توابع محاسباتی به‌شکل گراف‌های چند سطحی یا ژرف برای شناسایی و تخمین قانون حاکم بر حل یک مسئله پیچیده به‌کار بسته می‌شوند. شبکه‌های عصبی ژرف ابزاری برای طراحی و پیاده‌سازی این مدل یادگیری هستند. این شبکه‌ها در بسیاری از وظایف یادگیری ماشینی سخت، موفق ظاهر شده‌اند. به‌منظور استفاده از شبکه‌های ژرف در وظایفی که ترتیب ورودی داده‌ در انجام آن مؤثر است مانند اکثر وظایف حوزه پردازش زبان طبیعی، شبکه‌های عصبی مکرر ابداع گشتند که بازنمایی مناسبی از مدل‌های زبانی ارایه می‌دهند. این مدل‌ها در حالت ساده برای همه وظیفه‌های یک مدل زبانی مناسب نیستند. در این گزارش مدل خاصی از شبکه‌های مکرر تحت عنوان مدل توالی‌به‌توالی یا کدگذار-گدگشا بررسی می‌شود که برای وظایفی که شامل توالی‌های ورودی و خروجی با طول متفاوت هستند؛ نظیر ترجمه ماشینی، توسعه داده شده و توانسته است نتایج قابل قبولی را در این زمینه تولید کند. کلیدواژه‌ها: مدل توالی‌به‌توالی، شبکه عصبی مکرر، یادگیری ژرف، ترجمه ماشینی.

مقدمه

مدل‌ها و روش‌های یادگیری به‌کمک شبکه‌های عصبی ژرف (DNNs)¹ اخیرا، با افزایش قدرت محاسباتی سخت‌افزارها و نیز حل برخی از چالش‌های اساسی موجود بر سر راه آموزش و یادگیری این شبکه‌ها، بسیار مورد توجه واقع شده‌اند. DNNها در انجام وظایف سخت یادگیری ماشین مانند تشخیص گفتار، تشخیص اشیاء و غیره، فوق‌العاده قدرت‌مند ظاهر شده‌اند و در مواردی روش‌های سنتی را کاملاً کنار زده‌اند. قدرت بازنمایی زیاد DNNها به این دلیل است که قادر هستند محاسبات زیادی را به صورت موازی در چندین لایه انجام داده، با تعداد زیادی پارامتر پاسخ مسئله داده شده را تخمین زده و مدل مناسبی از آن ارایه دهند. درحال حاضر DNNهای بزرگ می‌توانند با استفاده از الگوریتم پس‌انتشار² به‌صورت بانظارت³ روی یک مجموعه آموزش برچسب‌زده و به‌قدر کافی بزرگ آموزش ببینند. بنابراین در مواردی که ضابطه حاکم بر یک مسئله دارای پارامترهای بسیار زیادی است و یک مقدار بهینه از این پارامترها وجود دارد (صرفا با استناد به این که مغز انسان همین مسئله را خیلی سریع حل می‌کند)، روش یادگیری پس‌انتشار این تنظیم از پارامترها (مقدارهای بهینه) را یافته و مسئله را حل می‌کند [1]. بسیاری از وظایف یادگیری ماشین به حوزه پردازش زبان طبیعی (NLP)⁴ مربوط می‌شوند؛ جایی که در آن معمولا ترتیب ورودی‌ها و خروجی‌های یک مسئله مهم است. برای مثال در ترجمه ماشینی دو جمله با واژه‌های یکسان ولی ترتیب متفاوت، معانی (خروجی‌های) مختلفی دارند. این وظایف اصطلاحا مبتنی بر توالی⁵ هستند. در واقع ورودی آنها به صورت یک توالی است. شبکه‌های عصبی رو به جلو ژرف⁶ برای این دسته از وظایف خوب عمل نمی‌کنند؛ چرا که قابلیتی برای به‌خاطر سپاری و مدل‌سازی ترتیب در آنها تعبیه نشده است.شبکه‌های عصبی مکرر (RNNs)⁷ خانواده‌ای از شبکه‌های عصبی برای پردازش وظایف مبتنی بر توالی هستند. همانطور که شبکه‌های عصبی پیچشی (CNNs)⁸، ویژه پردازش یک تور⁹ از مقادیر، برای مثال یک تصویر، طراحی شده‌اند؛ یک RNN نیز همسو با پردازش یک توالی از مقادیر ورودی $$ x\quad =\quad <{ x }^{ (1) },\quad { x }^{ (2) },\quad ...,\quad { x }^{ (n) }> $$ساخته شده است [2]. خروجی RNNها نیز مانند ورودی آنها در اغلب وظایف یک توالی است. این قابلیت پردازش توالی توسط شبکه‌های عصبی، آنها را برای استفاده در وظایف NLP، بسیار درخور ساخته است.

شرح مسئله و اهمیت موضوع

برخلاف انعطاف پذیری و قدرت بالای RNNها، در حالت ساده این شبکه‌ها یک توالی ورودی با طول ثابت را به یک توالی خروجی با همان طول نگاشت می‌کنند. این موضوع اما یک محدودیت جدی است؛ زیرا، بسیاری از مسائل مهم، در قالب توالی‌هایی که طولشان از قبل مشخص نیست، به‌ بهترین شکل قابل بیان هستند و در نظر گرفتن یک طول ثابت از پیش تعیین شده برای ورودی و خروجی به خوبی مسئله را مدل نمی‌کند. برای مثال ترجمه ماشینی (MT)¹⁰ و تشخیص گفتار¹¹ مسائلی از این دست هستند. همچنین سیستم پرسش و پاسخ را نیز می‌توان به صورت نگاشت یک توالی از واژه‌ها به‌عنوان پرسش، به یک توالی دیگر از واژه‌ها به عنوان پاسخ، در نظر گرفت. بنابراین پُر واضح است که ایجاد یک روش مستقل از دامنه برای یادگـیری نگاشت توالی‌به‌تولی مفید و قابل توجیه خواهد بود [1].

اهداف و راهکارها

همانطور که دیدیم طیف وسیعی از وظایف NLP مبتنی بر نگاشت توالی‌های با طول نامشخص و متغیر به یکدیگر است. همچنین روش‌های سنتی مثل n-garm دارای محدودیت‌های خاص خود در حل این دسته مسائل هستند و استفاده از روش‌های یادگیری ژرف به وضوح امید بخش بوده است. بنابراین هدف ارایه یک مدل مبتنی بر RNNها جهت نگاشت توالی‌به‌توالی است. در این گـزارش راهکار مطرح شده در [1] و نتایج آن به‌تفصیل شرح داده می‌شود. Stuskever و همکاران [1] نشان دادند که چگونه یک کاربرد ساده از شبکه با معماری حافظه کوتاه‌مدت بلند (LSTM)¹² می‌تواند مسائل نگاشت توالی‌به‌توالی را حل کند. ایده اصلی استفاده از یک LSTM برای خواندن توالی ورودی، به‌صورت یک نمونه در هر مرحله زمانی، جهت اقتباس برداری بزرگ با بعد ثابت و سپس استفاده از یک LSTM دیگر برای استخراج توالی خروجی از آن بردار است. LSTM دوم دقیقا یک مدل زبانی مبتنی بر RNN است با این تفاوت که حاوی احتمال شرطی نسبت به توالی ورودی نیز هست. قابلیت LSTM در یادگیری موفق وابستگی‌های مکانی طولانی مدت نهفته درون توالی‌ها، آن را برای استفاده در مدل پیشنهادی مناسب ساخته است. ‏شکل (1) یک طرح‌واره از این مدل را به صورت عام نشان می‌دهد.

داده‌ها و نتایج

مدل پیشنهادی در بخش قبل، برروی وظیفه ترجمه ماشینی عصبی (NMT)¹³ مورد آزمایش قرار گرفته است. برای انجام آزمایش‌ها از مجموعه داده ترجمه انگلیسی به فرانسوی WMT’14 استفاده شده است [3]. همچنین مجموعه داده کوچکتری در [4] وجود دارد که برای آموزش مدل‌های آزمایشی و غیر واقعی مناسب است. این مجموعه شامل ترجمه‌های انگلیسی به فارسی نیز هست. نتایج حاصل شده از این کار بدین قرار است. بر روی مجموعه داده WMT’14 با استخراج مستقیم ترجمه از پنج LSTM ژرف با 380 میلیون پارامتر، در نهایت امتیاز BLEU معادل 34.81 کسب گردیده است. این امتیاز بالاترین امتیازی است که تا زمان ارایه این مقاله از طریق NMT حاصل شده است. به‌عنوان مقایسه امتیاز BLEU برای ترجمه ماشینی آماری (SMT)¹⁴ برروی همین مجموعه داده برابر 33.30 است. این درحالی است که امتیاز 34.81 با احتساب اندازه واژه‌نامه 80هزار کلمه به‌دست آمده و هرجا که کلمه ظاهر شده در ترجمه مرجع در واژه‌نامه نبوده این امتیاز جریمه شده است. بنابراین نتایج نشان می‌دهد که یک معماری مبتنی بر شبکه عصبی تقریبا غیر بهینه، که نقاط زیادی برای بهبود دارد، قادر است تا روش‌های سنتی مبتنی بر عبارتِ سیستم SMT را شکست دهد [1].

مفاهیم اولیه

در این قسمت پیرامون سه مفهوم اصلی گزارش پیشرو، یعنی مدل زبانی (LM)¹⁵، شبکه‌های عصبی مکرر و ترجمه ماشینی عصبی، به‌صورت مختصر توضیحاتی ارایه می‌گردد.

مدل زبانی

مدل زبانی یک مفهوم پایه در NLP است که امکان پیش‌بینی نشانه بعدی در یک توالی را فراهم می‌کند. به‌بیان دقیق‌تر LM عبارت است از یک توزیع احتمالی روی یک توالی از نشانه‌ها (اغلب واژه‌ها) که احتمال وقوع یک توالی داده شده را مشخص می‌کند. در نتیجه می‌توان بین چندین توالی داده شده برای مثال چند جمله، آن را که محتمل‌تر است، انتخاب کرد [5]. LM برای توالی $$ x\quad =\quad <{ x }^{ (1) },\quad { x }^{ (2) },\quad ...,\quad { x }^{ (n) }> $$ عبارت است از: مدل‌های سنتی n-gram برای غلبه بر چالش‌های محاسباتی، با استفاده از فرض مارکوف رابطه ‏(1) را به درنظر گرفتن تنها n-1 نشانه قبلی محدود می‌کنند. به‌همین دلیل برای توالی‌های طولانی (بیشتر از 4 یا 5 نشانه) و دیده نشده مناسب نیستند. مدل‌های زبانی عصبی (NLMs)¹⁶ که بر مبنای شبکه‌های عصبی عمل پیش‌بینی واژه بعدی را انجام می‌دهند، در ابتدا برای کمک به n-gramها با آنها ترکیب شدند که منجر به ایجاد پیچیدگی‌های زیادی شد؛ در حالی که مشکل توالی‌های طولانی همچنان وجود داشت [5]. اخیرا اما، معماری‌های جدیدی برای LM که کاملا بر اساس DNNها است، ایجاد شده‌اند. سنگ‌بنای این مجموعه معماری‌ها RNNها بوده که در بخش بعدی معرفی می‌شوند.

شبکه‌های عصبی مکرر

شبکه‌های عصبی مکرر کلاسی از شبکه‌‌های عصبی هستند که به‌صورت یک گراف جهت‌دار دوری بیان می‌شوند. به‌عبارت دیگر ورودی هریک از لایه(های) پنهان یا خروجی علاوه بر خروجی لایه قبل، شامل ورودی از مرحله قبل به‌صورت بازخورد نیز می‌شود. شکل (2) یک RNN را نشان می‌دهد. همانطور که پیداست، لایه پنهان از مراحل قبلی هم بازخورد می‌گیرد. در هر مرحله‌زمانی t از (t=1 تا t=n) یک بردار x^(t) از توالی ورودی $$ x\quad =\quad <{ x }^{ (1) },\quad { x }^{ (2) },\quad ...,\quad { x }^{ (n) }> $$ پردازش می‌شود. در حالت کلی معادله‌های بروزرسانی (گذرجلو¹⁷) یک RNN در t عبارتند از [2]: که در آن بردارهای b و c بایاس و ماتریس‌‌های U، V و W به‌ترتیب وزن یال‌‌های لایه ورودی به پنهان، پنهان به خروجی و پنهان به پنهان، تشکیل‌دهنده مجموعه پارامترهای شبکه هستند. Φ تابع انگیزش است که معمولا یکی از توابع ReLU¹⁸ یا سیگموید¹⁹ انتخاب می‌شود. لایه آخر را نیز تابع بیشینه هموار²⁰ تشکیل می‌دهد که احتمال وقوع هر نشانه خروجی را مشخص می‌کند. در ‏شکل (2)، RNN با یک لایه پنهان نشان داده شده است. اما می‌توان RNNژرف با چندین لایه پنهان نیز داشت. همچنین طول توالی‌‌های ورودی و خروجی می‌تواند بسته به مسئله مورد نظر متفاوت باشد. karpathy در [6] RNNها را از منظر طول توالی ورودی و طول توالی خروجی به چند دسته تقسیم‌بندی کرده است. شکل (3) این دسته‌بندی را نشان می‌دهد. تصویر karpathy از حالت‌های مختلف RNN بعد از انتشار مقاله منتخب در این گزارش می‌باشد؛ با این حال در بخش 4 خواهیم دید که چگونه می‌توان از ترکیب این طرح‌ها نیز برای ایده معماری توالی‌به‌تولی الهام گرفت.

ترجمه ماشینی عصبی

به‌طور کلی MT را می توان با یک LM که به جمله زبان مبدأ مشروط شده است، مدل‌سازی کرد. بر همین اساس NMT را می‌توان یک مدل زبانی مکرر در نظر گرفت که مستقیما احتمال شرطی p(y|x) را در ترجمه جمله زبان مبدأ $$ x\quad =\quad <{ x }^{ (1) },\quad { x }^{ (2) },\quad ...,\quad { x }^{ (n) }> $$به جمله زبان مقصد $$ y\quad =\quad <{y }^{ (1) },\quad { y }^{ (2) },\quad ...,\quad { y }^{ (m) }> $$مدل می‌کند. دقت شود که طول جمله مبدأ یعنی n و جمله مقصد یعنی m الزاما برابر نیست. بنابراین در NMT هدف محاسبه این احتمال و سپس استفاده از آن در تولید جمله به زبان مقصد، هر دو به کمک DNNها است [5].

کارهای مرتبط

کارهای زیادی در زمینه NLMs انجام شده است. در بیشتر این کارها از شبکه‌های عصبی روبه‌جلو یا مکرر استفاده شده و کاربرد آن معمولا در یک وظیفه MT با امتیازدهی مجدد n فهرست بهتر²¹، اعمال شده و نتایج آن معمولا نشان از بهبود امتیازهای قبلی داشته است [1]. اخیرا کارهایی در زمینه فشردن اطلاعات زبان مبدأ در NLM انجام شده است. برای نمونه Auli و همکاران [7] NLM را با مدل عنوان²² جمله ورودی ترکیب کرده‌اند که نتایج بهبود بخشی داشته است. کار انجام شده در مقاله [1] به کار [8] بسیار نزدیک است. در مقاله [8] نویسندگان برای اولین بار توالی ورودی را در یک بردار فشرده کرده و سپس آن را به توالی خروجی تبدیل کردند. البته در این کار، برای تبدیل توالی به بردار، از CNNs استفاده شده که ترتیب واژه‌ها را حفظ نمی‌کند. چُـــو و همکاران [9] یک معماری شبهِ LSTM را برای نگاشت توالی ورودی به بردار و سپس استخراج توالی خروجی و نهایتا ترکیب آن با SMT استفاده کرده‌اند. معماری آنها از دو RNN با عنوان‌های کدگذار و کدگشا تشکیل شده که RNN اول وظیفه تبدیل یک توالی با طول متغیر به یک بردار با طول ثابت را قابل یک سلول زمینه c دارد و RNN دوم وظیفه تولید توالی خروجی را با لحاظ کردن c و نماد شروع جمله مقصد بر عهده دارد. معماری پیشنهادی آنها تحت عنوان کلی RNNکدگذار-کدگشا در ‏شکل (4) نشان داده شده است. چون آنها از LSTM استفاده نکرده و بیشتر تلاش خود را معطوف به ترکیب این روش با مدل‌های قبلی SMT کرده‌اند، برای توالی‌های ورودی و خروجی طولانی همچنان مشکل عدم حفظ حافظه وجود دارد. Bahdanau و همکاران [10] یک روش ترجمه مستقیم با استفاده از شبکه عصبی پیشنهاد داده‌اند که از سازوکار attention برای غلبه بر کارآمدی ضعیف روش [9] روی جملات طولانی استفاده می‌کند و به نتایج مطلوبی دست یافتند.

مدل توالی‌به‌توالی

در مدل توالی‌به‌توالی از دو RNN با واحدهای LSTM استفاده شده است. هدف LSTM در اینجا تخمین احتمال شرطی $$ p(<{ y }^{ (1) },\quad ...,\quad { y }^{ (m) }>\quad |\quad <{ x }^{ (1) },\quad ...,\quad { x }^{ (n) }>) $$ است که قبلا هم دیده بودیم (بخش 2-3). LSTM این احتمال شرطی را ابتدا با اقتباس بازنمایی بعد ثابت v برای توالی ورودی $$ <{ x }^{ (1) },\quad ...,\quad { x }^{ (n) }> $$ از آخرین مقدار حالت پنهان و در ادامه با محاسبه احتمال $$<{ y }^{ (1) },\quad ...,\quad { y }^{ (m) }> $$ از رابطه استاندارد مطرح در LM (رابطه (1)) و درنظر گرفتن برای حالت پنهان آغازین به‌صورت داده شده در رابطه زیر، حساب می‌کند: در رابطه ‏(6) هر توزیع احتمالی $$ p({ y }^{ (t) }\quad |\quad v,\quad y^{ (1) },\quad ...,\quad y^{ (t-1) }) $$ به‌وسیله یک تابع بیشینه هموار روی همه واژه‌های داخل واژه‌نامه بازنمایی می‌شود. برای LSTM از روابط [11] استفاده شده است. هر جمله در این مدل نیاز است تا با یک علامت خاص مثل EOS خاتمه یابد. این امر مدل را قادر می‌سازد تا بتواند توزیع احتمالی را روی توالی با هر طول دلخواهی تعریف کند. شمای کلی مدل در شکل (1) نشان داده شده است. در این شکل LSTM بازنمایی توالی ورودی $$ <'A','B','C',EOS> $$را حساب و سپس از این بازنمایی برای محاسبه احتمال توالی خروجی
$$ <'W','X','Y','Z',EOS> $$ استفاده می‌کند. در عین حال این مدل را می‌توان ترکیبی از قسمت‌های پ و ت شکل (3) دانست. مدل پیاده‌سازی شده در عمل از سه جنبه با مدل معرفی شده در بالا تفاوت دارد. اول، از دو LSTM جداگانه استفاده شده است: یکی برای توالی ورودی و دیگری برای توالی خروجی؛ زیرا، انجام این کار پارامترهای مدل را با هزینه محاسباتی اندکی، به تعداد بسیار زیادی افزایش می‌دهد. دوم اینکه LSTMهای ژرف به‌شکل قابل توجهی LSTMهای سطحی را شکست می‌دهند، به همین دلیل LSTM با ژرفای چهار لایه به‌کار گرفته شده است. سوم اینکه نویسندگان در این مقاله یافته‌اند که وارون کردن توالی ورودی در سرعتِ همگرایی آموزش شبکه و نیز دقت پیش‌بینی آن تأثیر شگرفی ایفا می‌کند. بنابراین به‌جای نگاشت مستقیم توالی a,b,c به توالی α, β, γ شبکه LSTM برای نگاشت c,b,a به α, β, γ آموزش داده می‌شود که در آن α, β, γ ترجمه یا خروجی متناظر با همان a,b,c است. توجیه علت این پدیده آن است که در نگاشت به شیوه وارون ابتدای عبارت‌ها که متناظر با یکدیگر هستند به‌هم نزدیک شده و این امر سبب زودتر همگرا شدن الگوریتم کاهش گرادیان تصادفی (SGD) و نزدیک شدن به مقادیر بهینه می‌شود [1].

آموزش شبکه

مدل توالی‌‌به‌توالی پس از معرفی توسط Sutskever و همکاران [1]، بارها و بارها تا به امروز مورد ارجاع دیگران قرار گرفته و تبدیل به یک مدل مرجع در NMT شده است. این مدل در رساله دکتری آقای لانگ [5] به‌تفصیل و همراه با برخی اصلاحات توضیح داده شده است. در این بخش به برخی جزئیات آموزش شبکه مدل توالی‌به‌توالی می‌پردازیم. شکل (5) یک نمایش دقیق‌تر از مدل ذکر شده در شکل (1) را نشان می‌دهد. آموزش شبکه بدین نحو است: ابتدا جمله زبان مقصد، سمت راست جمله متناظر خود در زبان مبدأ قرار داده می‌شود. نشان ‘-‘ در اینجا نقش EOS را دارد که البته می‌تواند پایان جمله مبدأ یا آغاز جمله مقصد را مشخص کند. بنابراین به هر کدام از دو گروه قابل تعلق است. LSTM سمت چپ یا همان شبکه کدگذار، در هر مرحله‌زمانی یک واژه از جمله زبان مبدأ را خوانده پس از تبدیل به نمایش مناسب حالت داخلی لایه پنهان را بروزرسانی می‌کند. در مرحله پردازش آخرین واژه مقادیر لایه‌های پنهان بردار ثابت که اکنون نماینده کل جمله ورودی زبان مبدأ است را تشکیل می‌دهد. سپس LSTM دوم یا شبکه کدگشا اولین واژه زبان مقصد را به همراه بردار v، به‌عنوان ورودی دریافت می‌کند و پیشبینی خود را انجام می‌دهد. برچسب واقعی این داده در واقع واژه بعدی در جمله زبان مقصد است. پس از مقایسه و محاسبه خطا، الگوریتم پس‌انتشار روی هر دو شبکه با شروع از شبکه کدگشا اجرا می‌شود و پارامترها را در خلاف جهت گرادیان تنظیم می‌کند. این روند تا پایان یافتن جمله زبان مقصد ادامه پیدا می‌کند. البته در عمل ممکن است ورودی به صورت یک دسته²³ به شبکه داده شده و گرادیان روی کل آن دسته حساب شود. به بیان دیگر در مجموع، شبکه کدگشا آموزش داده می‌شود تا جمله زبان مقصد را به همان جمله زبان مقصدی تبدیل کند که فقط واژه‌های آن یک واحد نسبت به جمله ورودی به سمت جلو جابه‌جا شده‌اند. این روش اصطلاحا teacher forcing نامیده می‌شود [2] و زمانی مناسب است که جمله زبان مقصد (توالی خروجی) کاملا مشخص باشد. در واقع واژه بعدی به عنوان برچسب در فرایند آموزش بانظارت مورد استفاده قرار می‌گیرد و وزن‌ها بر اساس آن تنظیم می‌گردند.

در مرحله استنتاج²⁷ یعنی هنگامی که می‌خواهیم جمله ناشناخته زبان مقصد (توالی خروجی) را کدگشایی نماییم، فرایند شرح داده شده در بالا، با اندکی تفاوت و در قالب گام‌های زیر انجام می‌پذیرد: 1. توالی ورودی با استفاده از شبکه کدگذار به بردار محتوا بدل می‌گردد. در صورتی که از سلول LSTM استفاده شود بردار محتوا برای هر لایه از شبکه حاوی دو متغیر حالت خواهد بود و در صورت استفاده از سلول GRU بردار محتوا برای هر لایه از شبکه دارای یک متغیر است. 2. یک توالی با اندازه ورودی 1 که ابتدا حاوی نشانه شروع جمله زبان مقصد است در ورودی شبکه کدگشا قرار داده می‌شود. 3. بردار محتوای حاصل شده از مرحله 1 به همراه توالی مرحله 2 به شبکه کدگشا داده می‌شوند تا نشانه (در اینجا واژه) بعدی جمله زبان مقصد پیش‌بینی شود. 4. از پیش‌بینی مرحله 4 نمونه برداری شده (به یکی از روش‌های حریصانه یا جست‌وجوی پرتوی محلی که در ادامه توضیح داده خواهد شد) و واژه بعدی انتخاب می‌شود. 5. واژه انتخاب شده در مرحله 4 به جمله زبان مقصد (توالی خروجی) الحاق می‌شود. 6. واژه انتخاب شده در مرحله 4 به جای نشانه شروع جمله به شبکه کدگشا داده می‌شود و مراحل 3 و 4 و 6 تکرار می‌شوند تا زمانی که نشانه پایان جمله تولید شود یا اینکه طول جمله تولید شده از یک حد از پیش تعیین شده بیشتر شود. نکته لازم به ذکر دیگر آن است که توالی ورودی انتخاب شده در این مرحله از مجوعه آزمون انتخاب می‌شود. در واقع مرحله استنتاج روی داده‌های آزمون و برای ارزیابی مدل انجام می‌پذیرد.

جزئیات آموزش شبکه

در مقاله [1] از LSTMژرف با چهار لایه و 1000 سلول حافظه در هر لایه استفاده شده است. همچنین اندازه واژگان ورودی 160هزار و اندازه واژگان خروجی 80هزار کلمه است. حاصل کار یک شبکه LSTM با مجموع 380میلیون پارامتر بوده که 64میلیون آن اتصالات برگشتی هستند. دیگر جزئیات پارامترها و آموزش شبکه عبارتند از:

پارامترها با مقادیر تصادفی از توزیع یکنواخت در بازه [0.08+ و 0.08-] مقداردهی اولیه شده‌اند.
برای آموزش از SGD استاندارد با نرخ یادگیری 0.7 استفاده شده است. بعد از گذشت پنج دوره²⁴، نرخ یادگیری در هر نیم‌دور، نصف می‌شود. در ضمن تعداد کل دوره‌های آموزش برابر 7.5 بوده است.
گرادیان بر روی دسته‌های 128تایی از توالی‌ها محاسبه شده و به اندازه دسته، یعنی 128، تقسیم می‌شود.
هرچند LSTMها از معضل میرایی گرادیان²⁵ رنج نمی‌برند، اما ممکن است مشکل انفجار گرادیان²⁶ را داشته باشند. بنابراین محدودیت سختی بر مقدار نورم گرادیان اعمال می‌شود به‌این نحو که هنگامی که نورم از مقدار آستانه‌ای بیشتر شد، مجددا تنظیم شود. برای هر دسته در مجموعه آموزش مقدار $$ s={ ||g|| }_{ 2 }$$

محاسبه می‌شود که در آن g مقدار گرادیان پس از تقسیم بر 128 است. اگر s>5 شد آنگاه قرار داده می‌شود: $$ g=\frac { 5g }{ s }. $$ + جملات مختلف طول‌های مختلفی دارند. بیشتر آنها کوتاه هستند (طولی بین 20 تا 30 دارند) اما برخی از آنها طولانی هستند (طولی بیشتر از 100 دارند)؛ بنابراین دسته‌های 128تایی از جملات که تصادفی انتخاب می‌شوند تعداد کمی جمله طولانی داشته و تعداد زیادی جمله کوتاه و در نتیجه سبب می‌شود تا بیشتر محاسبات داخل هر دسته هدر روند. برای غلبه بر این موضوع سعی شده است همه جملات داخل یک دسته طول تقریبا مساوی داشته باشند. این امر انجام محاسبات را تا 2 برابر تسریع کرده ‌است.

آزمایش‌ها

روش یادگیری توالی‌به‌توالی معرفی شده روی وظیفه ترجمه ماشینی انگلیسی به فرانسوی در دو حالت مختلف آزمایش گردیده است. در حالت اول مدل، برای ترجمه مستقیم جملات انگلیسی به فرانسوی به‌کار گرفته شده و در حالت دوم برای امتیاز دهی مجدد n فهرست بهتر از جملات در وظیفه SMT استفاده شده است. در این قسمت نتایج آزمایش‌های انجام گرفته در قالب امتیازهای ترجمه کسب شده، نمونه جملات ترجمه شده و بلاخره مصورسازی بازنمایی جملات ورودی، بیان شده است.

پیاده‌سازی

پیاده‌سازی مدل اولیه با زبان ++C انجام شده است. این پیاده‌سازی از LSTM ژرف با پیکربندی شرح داده شده در بخش 4-1-2 روی یک GPU، تقریبا 1700 واژه بر ثانیه را پردازش می‌کند. این سرعت برای پردازش حجم داده زیادی مثل مجموعه WMT بسیار پایین است. برای این منظور مدل به صورت موازی شده روی 8 عدد GPU اجرا می‌گردد. هر لایه از LSTM روی یک GPU اجرا شده و فعالیت‌های خود را به محض محاسبه به GPU یا لایه بعدی می‌دهد. چون مدل چهار لایه دارد، چهار GPU دیگر برای موازی‌سازی بیشینه هموار استفاده شده‌اند بنابراین هر GPU مسئول محاسبه یک ضرب ماتریسی (ماتریس با اندازه 2000 × 1000) است. نتیجه حاصل از این موازی‌سازی در سطح GPU، رسیدن به سرعت پردازش 6300 واژه بر ثانیه است. فرایند آموزش در این شیوه پیاده‌سازی، 10 روز به طول انجامید [1]. علاوه بر پیاده‌سازی اولیه، پیاده‌سازی‌های دیگری نیز از این مدل در زبان‌ها و چهارچوب‌های مختلف ارایه شده است؛ از جمله دو پیاده‌سازی خوب با زبان پایتون و روی چهارچوب‌های کاری Tensorflow و Keras. پیاده‌سازی Tensorflow سازوکارهای جدیدتر مثل سازوکار attention را نیز اضافه کرده است [12]. پیاده‌سازی Keras هم به جای واژه، در سطح کاراکتر انجام شده است [13]. اگرچه در همه پیاده‌سازی‌ها ترجمه ماشینی، به‌عنوان وظیفه انتخاب شده است. اما این مدل عام بود و برای هر وظیفه‌ای که شامل نگاشت یک توالی ورودی به یک توالی خروجی با طول‌های متفاوت است، قابل اعمال خواهد بود.

جزئیات مجموعه داده

همانطور که قبلا گفته شد (بخش ‏3-1) از مجموعه داده ترجمه انگلیسی به فرانسوی WMT’14 در آزمایش‌ها استفاده شده است [3]. مدل توصیف شده روی یک زیرمجموعه 12میلیون جمله‌ای، شامل 348میلیون واژه فرانسوی و 340میلیون واژه انگلیسی، آموزش داده شده است. وظیفه ترجمه ماشینی و همچنین این مجموعه داده خاص، به خاطر دردسترس بودن عمومی یک مجموعه آموزش و یک مجموعه آزمون نشانه‌گذاری شده²⁸ جهت اهداف آموزش و ارزیابی مدل انتخاب شده است و مدل توالی‌به‌تولی مستقل از یک وظیفه خاص است. همچنان‌که مدل‌های زبانی عصبی معمولی روی یک بازنمایی برداری در نمایش هر کلمه تکیه می‌کنند، در اینجا نیز یک واژه‌نامه با اندازه ثابت، برای هر دو زبان به‌کار گرفته شده است. برای این منظور، 160هزار واژه از پر استفاده‌ترین واژه‌های زبان مبدأ (انگلیسی) و نیز 80هزار واژه از پر استفاده‌ترین واژه‌های زبان مقصد (فرانسوی) برگزیده شده‌اند. هر واژه خارج از این واژه‌نامه‌ها که در جمله‌ها ظاهر شده باشد، با نشانه خاص “UNK” جایگزین شده است. برای پیاده‌سازی [12] از مجموعه داده ترجمه آلمانی-انگلیسی WMT’16 [14] استفاده شده است و همچنین مدل نمونه پیاده‌سازی شده در [13] از مجموعه داده کوچکتر موجود در [4] استفاده کرده است که قابل جایگزین کردن با مجموعه‌های ذکر شده در بالا نیز هست. ایراد اساسی پیاده‌سازی در سطح کاراکتر [13] این است که معمولا در ترجمه ماشینی واژه‌ها به یکدیگر متناظر می‌شوند نه کاراکترها لذا این مدل از دقت مدل‌های در سطح واژه برخوردار نیست اما ایده خوبی در مورد استفاده در سایر وظایف مبتنی بر نگاشت توالی‌به‌توالی نظیر تولید متن به دست می‌دهد.

کدگشایی و امتیازدهی مجدد

هسته اصلی آزمایش‌های انجام شده در [1]، آموزش یک LSTM ژرف بزرگ روی تعداد زیادی جفت از جمله‌های زبان مبدأ و زبان مقصد است. آموزش با بیشینه کردن احتمال لگاریتمی یک ترجمه صحیح T برای جمله مبدأ داده شده S انجام می‌شود. بنابراین هدف آموزش عبارت است از: که در آن S مجموعه آموزش است. وقتی آموزش کامل شد، ترجمه‌ها با یافتن درست‌ترین ترجمه از روی LSTM تولید می‌شوند: برای یافتن درست‌ترین ترجمه از یک کدگشای ساده با جست‌وجوی پرتوی محلی²⁹ چپ به راست استفاده شده است که تعداد B فرضیه جزئی³⁰ را نگه‌داری می‌کند. هر فرضیه جزئی پیشوندی از تعدادی ترجمه است. در هر مرحله زمانی، هر فرضیه جزئی با واژه‌های محتمل از داخل واژه‌نامه گسترش داده می‌شود. این روند تعداد فرایض جزئی را به‌سرعت افزایش می‌دهد. با توجه به مدل احتمال لگاریتمی، تمام این فرضیه‌ها به غیر از B فرضیه محتمل اول کنار گذاشته می‌شوند. به‌مجرد اینکه نشانه “EOS” به یک فرضیه الصاق شد، از جست‌وجوی پرتوی محلی حذف و به مجموعه فرایض کامل افزوده می‌گردد. هرچند این روش کدگشایی تقریبی است؛ اما، برای پیاده‌سازی راحت خواهد بود. سیستم پیشنهادی حتی با اندازه پرتوی 1 و نیز اندازه پرتوی 2 بیشترین مزایای این روش جست‌وجو را فراهم می‌آورد. امتیازهای BLEU حاصله از آزمایش‌های انجام شده روی مدل، در جدول (1) ذکر شده‌ است.

وارون‌سازی جملات مبدأ

درحالی‌که LSTM قابلیت حل مسائل با وابستگی‌های طولانی مدت را دارد، در طول آزمایش‌های انجام شده در [1] پژوهشگران یافته‌اند که وقتی جمله‌های مبدأ وارون شده و به‌عنوان ورودی به شبکه کدگذار داده می‌شوند، LSTM بهتر آموزش می‌بیند. توجه شود که جملات مقصد وارون نمی‌شوند. با انجام این عمل ساده، مقدار سرگشتگی³¹ مدل از 5.8 به 4.7 کاهش یافته‌است و مقدار امتیاز BLEU کسب شده از ترجمه‌های کدگشایی شده مدل نیز از 25.9 به 30.6 افزایش داشته است. نویسندگان [1] توضیح کاملی برای توجیه اثر این پدیده نداشته‌اند. توجیه اولیه آنها بدین ترتیب است که عمل وارون‌سازی جملات زبان مبدأ باعث معرفی بسیاری از وابستگی‌های کوتاه مدت به مجموعه داده می‌شود. وقتی جمله‌های زبان مبدأ را با جمله‌های زبان مقصد الحاق می‌کنیم، هر واژه در جمله مبدأ از واژه نظیرش در جمله مقصد دور می‌افتد. در نتیجه، مسئله یک دارای یک تأخیر زمانی کمینه³² خیلی بزرگ می‌شود [1]. با وارون‌سازی واژه‌ها در جمله مبدأ فاصله میانگین بین واژه‌های نظیر به نظیر در جمله‌ مبدأ با جمله مقصد تغییر نمی‌کند. هرچند تعداد کمی از واژه‌های آغازین جمله مبدأ در این حالت به واژه‌های آغازین جمله مقصد بسیار نزدیک می‌شوند؛ بنابراین تأخیر زمانی کمینه مسئله تا حد زیادی کاهش می‌یابد و الگوریتم پس‌انتشار زمان کمتری را برای استقرار ارتباط میان واژه‌های جمله‌های مبدأ و جمله‌های مقصد سپری خواهد نمود. این امر درنهایت منجربه بهبود قابل توجه کارآمدی کلی مدل می‌گردد. ایده وارون‌سازی جمله‌های ورودی از این مهم نشئت گرفته است که در ابتدا تصور شده وارون‌سازی فقط به پیش‌بینی با اطمینان‌تر واژه‌های آغازین در زبان مقصد کمک می‌کند و منجربه پیش‌بینی کم اطمینان‌تر واژه‌های پایانی می‌شود. هرچند LSTMای که روی جملات مبدأ وارون شده آموزش دیده، در مقایسه با LSTM معمولی، روی جمله‌های طولانی عملکرد بهتری از خود نشان داده است (رجوع شود به بخش ‏1-6).

ارزیابی نـتایج

به‌منظور ارزیابی کیفیت ترجمه‌های صورت گرفته توسط مدل از روش امتیازدهی خودکار BLEU [16] استفاده شده است. برای محاسبه امتیاز BLEU، اسکریپت آماده multi-bleu.pl³³ به‌کار رفته است. این نوع امتیاز دهی در کارهای قبلی مشابه نیز استفاده شده است [9] و [10]، بنابراین قابل اطمینان خواهد بود و مقایسه مدل‌ها را امکان‌پذیر می‌سازد. به‌عنوان نمونه، این اسکریپت برای [10] امتیاز 28.45 را تولید کرده است. نتایج در جدول‌های (1) و (2) ارایه شده‌اند. بهترین نتیجه از مجموعه LSTMهایی که در مقداردهی اولیه تصادفی و ترتیب تصادفی ریزدسته‌ها تفاوت داشته‌اند، حاصل شده است. هرچند سازوکار کدگشایی ترجمه به‌کار برده شده در اینجا (جست‌وجوی پرتوی محلی)، سازوکار ساده و ضعیفی است؛ با این حال نخستین بار است که یک سیستم ترجمه ماشینی عصبی خالص، سیستم ترجمه ماشینی مبتنی بر عبارات را با اختلاف قابل توجهی شکست می‌دهد. این سیستم همچنین فاقد قابلیت کنترل واژه‌های خارج از واژه‌نامه است و همان‌طور که قبلا هم بیان شد کلیه واژه‌های بیرون از واژه‌نامه با واژه “UNK” جایگزین شده‌اند. بنابراین در صورتی که سازوکاری برای کنترل این واژه‌ها نیز به مدل اضافه شود یا اندازه واژه‌نامه افزایش یابد، عملکرد این سیستم باز هم جای بهبود خواهد داشت.

جدول (1) کارآمدی LSTM روی مجموعه آزمون ترجمه انگلیسی به فرانسوی WMT’14 (ntst14). توجه شود که یک مجموعه متشکل از پنج LSTM با اندازه پرتوی 2، ارزان‌تر (سبک‌تر) از یک LSTM تنها با اندازه پرتوی 12 است [1].

روش	امتیاز BLEU (ntst14)
Bahdanau و همکاران [10]	28.45
یک LSTM روبه‌جلو، اندازه پرتوی 12	26.17
یک LSTM با ورودی وارون، اندازه پرتوی 12	30.59
پنج LSTM با ورودی وارون، اندازه پرتوی 1	33.00
دو LSTM با ورودی وارون، اندازه پرتوی 12	33.27
پنج LSTM با ورودی وارون، اندازه پرتوی 21	34.50
پنج LSTM با ورودی وارون، اندازه پرتوی 12	34.81

جدول (2) روش‌های مشابه که شبکه‌های عصبی را در کنار ترجمه ماشینی سنتی روی مجموعه داده WMT’14 در ترجمه انگلیسی به فرانسوی استفاده کرده‌اند [1].

روش	امتیاز BLEU (ntst14)
لـبه پژوهش [15]	37.00
چــو و همکاران [9]	34.54
امتیازدهی مجدد 1000فهرست بهتر با یک LSTM روبه‌جلو	35.61
امتیازدهی مجدد1000فهرست بهتر با یک LSTM وارون	35.85
امتیازدهی مجدد1000فهرست بهتر با پنج LSTM وارون	36.50
پیش‌گویی امتیازدهی مجدد 1000فهرست بهتر	45~

تحلیل مدل

یکی از ویژگی‌های جذاب مدل توالی‌به‌توالی ارایه شده در [1]، توانایی تبدیل یک توالی از واژه‌ها به یک بردار با ابعاد ثابت است. شکل (6) تعدادی از بازنمایی‌های یادگرفته شده در روند آموزش را مصورسازی کرده است. این تصویر به وضوح نشان می‌دهد که بازنمایی‌های ایجاد شده به ترتیب واژه‌ها حساس هستند؛ زیرا از جمله‌هایی با واژه‌های یکسان و ترتیب متفاوت در تصویر استفاده شده است. بازنمایی واقعی مدل در ابعاد بالاتری بود و برای نگاشت روی دو بعد روش PCA به‌کار برده شده است.

کارآمدی روی جملات طولانی

خروجی مدل روی جمله‌های طولانی (از منظر تعداد واژه) کارآمدی بسیار خوب LSTM را در این زمینه تأیید می‌کند. یک مقایسه کمی از نتایج حاصل شده در شکل (7) نشان داده شده است. همچنین جدول (3) چندین جمله طولانی و ترجمه‌های تولید شده توسط مدل برای آنها را ارایه می‌کند.

جدول (3) سه مثال از ترجمه‌های طولانی تولید شده توسط مدل توالی‌به‌توالی در مقایسه با ترجمه صحیح. خواننده می‌تواند صحت نتایج را با استفاده از مترجم گوگل تا حد خوبی درک کند [1].

نتیجه‌گیری و کارهای آتی

در این گزارش یک مدل یادگیری ژرف جدید برای یادگیری و نگاشت توالی از ورودی‌ها به توالی از خروجی‌ها مطرح و بحث گردید. نشان داده شد که یک شبکه LSTM ژرف با واژگان محدود روی وظیفه ترجمه ماشینی، قادر به شکست سیستم‌های ترجمه ماشینی استاندارد مبتنی بر عبارات با واژگان نامحدود است. موفقیت این رویکرد نسبتا ساده روی وظیفه ترجمه ماشینی نشان دهنده این است که این مدل باید روی دیگر وظیفه‌های مبتنی بر توالی نیز در صورت فراهم بودن مجموعه داده‌های آموزش کافی، بسیار خوب عمل کند. در طی فرایند آموزش این اصل نیز کشف شده که وارون سازی توالی مبدأ سبب افزایش دقت و بهبود کارآمدی مدل می‌شود. می‌توان نتیجه گرفت پیدا کردن روشی که وابستگی‌های کوتاه مدت را زودتر معرفی کند در هر صورت آموزش مدل را خیلی ساده‌تر می‌کند. لذا به نظر می‌رسد که حتی آموزش یک RNN استاندارد (مدل غیر توالی‌به‌توالی) نیز با این روش بهتر باشد. البته این مورد در عمل مورد آزمایش قرار نگرفته است و بنابراین به صورت یک فرضیه باقی است. نتیجه قابل ذکر دیگر، قابلیت LSTM در یادگیری صحیح ترجمه توالی‌های طولانی است. در ابتدا تصور می‌شد که LSTM به دلیل حافظه محدود خود در یادگیری جمله‌های طولانی شکست بخورد؛ همچنان‌که پژوهشگران دیگر در کارهای مشابه عملکرد ضعیفی را برای LSTM گزارش کرده بودند. با این حال اما روی جمله‌های خیلی طولانی در حالت وارون همچنان مشکل تضعیف حافظه پابرجاست و احتمالا قابلیت بهبود داشته باشد. در نهایت نتایج رضایت بخش این مدل یادگیری نشان دهنده این است که یک مدل ساده از شبکه‌های عصبی ژرف، که هنوز جای بهبود و بهینه‌سازی‌های زیادی در خود دارد، قادر به شکست بالغ‌ترین سیستم‌های ترجمه ماشینی سنتی است. کارهای آتی می‌تواند بر روی افزایش دقت مدل توالی‌به‌توالی و پیچیده‌تر کردن آن در راستای یادگیری بهتر توالی‌های طولانی باشد. در آینده نزدیک این مدل‌ها روش‌های سنتی را کاملا منسوخ می‌کنند. نتایج همچنین نشان می‌دهد این رویکرد روی دیگر وظیفه‌های مبتنی بر نگاشت توالی‌به‌توالی می‌تواند موفقیت آمیز ظاهر شود. این مهم، زمینه را برای حل مسائل مختلفی در دیگر حوزه‌های علوم آماده می‌سازد. می‌توان از این مدل برای ترجمه ماشینی متون طولانی انگلیسی به فارسی و بالعکس استفاده کرد در این وظیفه اثر وارون‌سازی جمله زبان مبدأ باید بررسی شود؛ زیرا، به نظر می‌رسد در زبان‌های از راست به چپ با این کار تأخیر زمانی کمینه افزایش پیدا ‌کند و نتیجه بدتری حاصل شود. در وظایف دیگر مثل سیستم پرسش و پاسخ نیز می‌توان از این مدل استفاده کرد. در تولید محتوا و برای کامل کردن متون تاریخی و اشعاری که بخش‌هایی از آنها وجود ندارد یا از بین رفته است استفاده از این مدل جالب و ارزشمند به نظر می‌رسد. علاوه بر استفاده در وظایف جدید، تغییر معماری خود مدل نیز، جهت افزایش دقت وظایف نام برده پیشنهاد می‌شود. برای مثال استفاده از RNN دوسویه، ترکیبی و نیز دارای حالت در شبکه کدگذار و کدگشا، استفاده از ژرفای بیشتر لایه‌ها، تغییر دیگر ابرپارامترهای شبکه نظیر نرخ آموزش و افزودن سازوکار توجه می‌تواند از جمله پیشنهادهایی باشد که در ساختن مدل‌های با دقت بیشتر قابل استفاده هستند. همچنین برای مواردی که داده‌های برچسب‌دار به اندازه کافی موجود نیستند یا تمامی توالی خروجی یکجا دردسترس نیست (مثل یادگیری برخط یا یادگیری تقویتی)، استفاده از روش بیان شده در مرحله استنتاج به هنگام آموزش، به جای teacher forcing راهکار مناسبی به نظر می‌رسد.

مراجع

[1] Q.V. Le Ilya Sutskever, Oriol Vinyals, I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Nips, pp. 1–9, 2014. [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016. [3] “ACL 2014 ninth workshop on statistical machine translation.” [Online]. Available: http://www.statmt.org/wmt14/medical-task/index.html. [Accessed: 13-Nov-2017]. [4] “Tab-delimited bilingual bentence pairsfrom the tatoeba project (good for anki and similar flashcard applications).”[Online]. Available: http://www.manythings.org/anki/. [Accessed: 13-Nov-2017]. [5] M. T. Luong, “Neural machine translation,” Stanford university, 2016. [6] A. Karpathy, “Connecting images and natural language,” Stanford University, 2016. [7] M. Auli, M. Galley, C. Quirk, and G. Zweig, “Joint language and translation modeling with recurrent neural networks.,” Emnlp, no. October, pp. 1044–1054, 2013. [8] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” Emnlp, no. October, pp. 1700–1709, 2013. [9] K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” 2014. [10] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” pp. 1–15, 2014. [11] A. Graves, “Generating sequences with recurrent neural networks,” pp. 1–43, 2013. [12] M.-T. Luong, E. Brevdo, and R. Zhao, “Neural machine translation (seq2seq) tutorial,” https://github.com/tensorflow/nmt, 2017. [13] “Sequence to sequence example in Keras (character-level),” 2017. [Online]. Available: https://github.com/fcholle/keras/blob/master/examples/lstm_seq2seq.py. [Accessed: 13-Nov-2017]. [14] “Index of /wmt16/translation-task.” [Online]. Available: http://data.statmt.org/wmt16/translation-task/.[Accessed: 04-Dec-2017]. [15] N. Durrani, B. Haddow, P. Koehn, and K. Heafield, “Edinburgh’s phrase-based machine translation systems for WMT-14,” Proc. Ninth Work. Stat. Mach. Transl., pp. 97–104, 2014. [16] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A method for automatic evaluation of machine translation,” … 40Th Annu. Meet. …, no. July, pp. 311-318,2002.

پیوست الف: پیاده‌سازی مدل توالی‌به‌توالی در keras

در این قسمت جزئیات کد مدل توالی‌به‌توالی پیاده‌سازی شده در keras و تغییرات آن را شرح می‌دهیم. این کد به همراه مجموعه داده از پیوندهای ابتدای گزارش قابل دریافت است. پیاده‌سازی مدل توالی‌به‌توالی در اینجا در سطح کاراکتر است یعنی وظیفه ترجمه ماشینی را کاراکتر به کاراکتر انجام می‌دهد. البته برای وظیفه ترجمه ماشینی مدل در سطح واژه مرسوم است. شروع از سطح کاراتر ساده تر بوده و بعدا با اضافه کردن یک لایه embedding می‌توان مدل را به آسانی در سطح واژه آموزش داد. مجموعه آموزش شامل یک فایل متنی است که در هر سطر آن یک عبارت انگلیسی و سپس ترجمه معادل آن آمده است. دو عبارت در یک سطر با کاراکتر t\ از هم جدا شده‌اند. بنابراین جمله زبان مقصد با کاراکتر t\ شروع و با کاراکتر n\ خاتمه می‌یابد. برای تغییر حالت از کدگذار به کدگشا از نشانه t\ و برای مشخص کردن پایان جمله زبان مقصد از نشانه n\ استفاده خواهد شد. ابتدا فایل ورودی را سطر به سطر خوانده و دو بخش متن ورودی و متن هدف را را از روی آن می‌سازیم. سپس با روش one-hot متن ورودی و متن هدف را به بردار عددی معادل تبدیل می‌کنیم. تکه کد زیر اینکار را انجام می‌دهد:

...
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])
encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype='float32')
decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens),dtype='float32')
# bulid one-hot vector
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t, char in enumerate(input_text):
    encoder_input_data[i, t, input_token_index[char]] = 1.
for t, char in enumerate(target_text):
    # decoder_target_data is ahead of decoder_input_data by one timestep
    decoder_input_data[i, t, target_token_index[char]] = 1.
    if t > 0:
        # decoder_target_data will be ahead by one timestep
        # and will not include the start character.
        decoder_target_data[i, t - 1, target_token_index[char]] = 1.
...

دقت شود که ورودی کدگشا در مرحله آموزش، عبارت زبان مقصد (متن هدف) و خروجی آن نیز همان عبارت زبان مقصد است که یک واحد به جلو شیفت داده شده است (روش موسوم به teacher forcing). کد بالا این کار را نیز انجام می‌دهد یعنی خروجی کدگشا را به همین روش اضافه می‌کند. حال نوبت به تعریف LSTM کدگذار و LSTM کدگشا می‌رسد. در keras کلاس LSTM کلیه وظایف مربوط به این نوع شبکه را پیاده‌سازی کرده است. کافی است یک نمونه (شیء) از این کلاس ایجاد کنیم. این کلاس همچنین متد call را داراست که لایه ورودی را به عنوان آرگومان دریافت و به شیء ساخته شده از کلاس متصل می‌کند. LSTM کدگذار بنابراین به‌صورت زیر تعریف می‌شود:

...
encoder_inputs = Input(shape=(None, num_encoder_tokens))
#print(type(encoder_inputs))
encoder = LSTM(latent_dim, return_state=True)
#print(type(encoder))
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
...

وقتی از آرگومان return_state=True در ساخت یک شی از کلاس LSTM استفاده می‌شود دو حالت حافظه موجود در LSTM هم به عنوان خروجی، علاوه بر توالی خروجی اصلی بازگردانیده می‌شوند. در کد بالا این دو حالت state_c و state_h نام دارند. خروجی کدگذار در مدل توالی‌به‌توالی استفاده‌ای ندارد و دور انداخته می‌شود. در عوض از حالت‌های state_c و state_h به عنوان حالت آغازین LSTM کدگشا به صورت زیر استفاده می‌شود:

...
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,initial_state = encoder_states)
...

همچنین یک لایه softmax برروی خروجی توالی‌ نهایی LSTMکدگشا جهت تبدیل خروجی به احتمالات معتبر به شکل زیر قرار می‌دهیم:

...
decoder_dense = Dense(num_decoder_tokens, activation = 'softmax')
decoder_outputs = decoder_dense(decoder_outputs)
...

اکنون لایه‌های مدل ساخته شده است. این لایه‌ها بایستی به شکل یک گراف به هم متصل شده و تشکیل یک مدل با ورودی و خروجی معین را بدهند. در keras دو نوع مدل وجود دارد. نوع اول مدل‌های ترتیبی (sequential) هستند که یک پشته خطی از لایه‌ها را در قالب مدل به هم مرتبط می‌کنند. یعنی گراف نهایی مدل ترتیبی حالتی خطی دارد. مدل ترتیبی برای ایجاد مدل‌های پیچیده‌تر مثل مدل توالی‌به‌توالی مناسب نیست. نوع دوم مدل در keras با استفاده از Keras functional API ساخته می‌شوند. این حالت برای ساختن مدل‌هایی با چند ورودی و چند خروجی که گراف آن‌ها لزوما خطی نیست به کار می‌رود. در اینجا از این روش استفاده شده است. برای این منظور پس از تعیین تک تک لایه‌‌ها (کد قسمت‌های قبلی)، از کلاس Model استفاده کرده و ورودی و خروجی نهایی مدل را تعیین می‌کنیم:

...
# Define the model that will turn
#encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
...

با استفاده از تابع plot_model می‌توان به صورت گرافیکی بهم‌بست مدل ایجاد شده را مشاهده کرد:

...
plot_model(model, to_file = './modelpic/seq2seq_model_' + dt + '.png', show_shapes=True, show_layer_names=True)
...

برای استفاده از این تابع لازم است با دستور from keras.utils import plot_model بسته حاوی تابع plot_model را در ابتدای کد، به برنامه اضافه کنیم. نتیجه اجرای این تابع به شکل زیر است:

در مرحله بعد تابع خطا و روش یادگیری مدل تعیین می‌شود:

...
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
...

و درنهایت مدل را با داده‌های واقعی آموزش می‌دهیم:

...
# Run training
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
      batch_size=batch_size,
      epochs=epochs,
      validation_split=0.2)
...

پس از آموزش مدل نوبت به آزمون مدل می‌رسد. در این مرحله می‌خواهیم با دادن یک جمله زبان مبدأ به مدل ترجمه معادل آن در زبان مقصد را کدگشایی کنیم. برای این منظور نیاز است تا مدل‌هایی مشابه اما جدا از مدل آموزش تعریف کنیم. ابتدا مدل کدگذار را به نحوی تعریف می کنیم که جمله زبان مبدا را به عنوان ورودی بپذیرد و حالات آغازین مدل کدگشا را به عنوان خروجی تولید کند:

...
## encoder model
encoder_model = Model(encoder_inputs, encoder_states)

مدل کدگشا را طوری تعریف می‌کنیم که ورودی آن شامل حالت‌ها و خروجی مرحله قبلی خود باشد. در مرحله اول کدگشایی مقادیر این ورودی‌ها عبارت‌اند از حالات خروجی مدل کدگذار و نشانه آغاز جمله زبان مقصد یعنی همان t. تکه کد زیر مدل کدگشا را تعریف می‌کند:

...
## decoder model
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs,[decoder_outputs] + decoder_states)
...

دقت شود که برای ساخت مدل از همان لایه‌های (اشیای LSTM کدگذار و کدگشای ) ساخته شده در مرحله آموزش استفاده شده است. در ای حالت وزن لایه‌ها بین دو مدل ب اشتراک گذاشته می‌شود. یعنی وزن لایه‌ها از مرحله آموزش گرفته می‌شود و از این مدل‌ها فقط برای پیش‌بینی خروجی با داشتن ورودی، استفاده می‌شود. تابع decode_sequence آمده در زیر، به عنوان آرگومان یک توالی ورودی را دریافت و توالی خروجی معادل آن را باز می‌گرداند:

...
def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1, 1, num_decoder_tokens))
# Populate the first character of target sequence with the start character.
target_seq[0, 0, target_token_index['\t']] = 1.
# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ''
while not stop_condition:
    output_tokens, h, c = decoder_model.predict(
        [target_seq] + states_value)
    # Sample a token
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    sampled_char = reverse_target_char_index[sampled_token_index]
    decoded_sentence += sampled_char
    # Exit condition: either hit max length
    # or find stop character.
    if (sampled_char == '\n' or
       len(decoded_sentence) > max_decoder_seq_length):
        stop_condition = True
    # Update the target sequence (of length 1).
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, sampled_token_index] = 1.
    # Update states
    states_value = [h, c]
return decoded_sentence
...

مدل‌های کدگذار و کدگشا که تابع فوق از آنها استفاده می‌کند و قبل از این تعریف شدند نیز به‌ ترتیب به‌صورت گراف‌های زیر هستند:

توجه: شماره‌های سمت چپ هر گره در گراف‌‌های این بخش به صورت ترتیبی توسط keras قرار داده می‌شوند و اهمیتی ندارند.

ژرف‌سازی شبکه

اگرچه مدل توضیح داده‌شده در این قسمت به طور کامل از مفاهیم شبکه‌های عصبی ژرف استفاده می‌کند اما به معنای واقعی کلمه ژرف نیست. در keras به راحتی می‌توان یک مدل ژرف را با پشته کردن لایه‌ها روی یکدیگر ایجاد کرد. برای مثال چنانچه بخواهیم شبکه کدگذار مدل فوق دارای دو لایه LSTM باشد کافی است اولین لایه شبکه کدگذار (encoder_l1) را به‌گونه‌ای تعریف کنیم که یک توالی را به‌عنوان خروجی بدهد. سپس لایه ورودی را به این لایه متصل می‌کنیم و لایه LSTM موجود در کد قبلی این‌بار لایه جدید را به عنوان ورودی می‌پذیرد:

...
# Define an input sequence. 
encoder_inputs = Input(shape=(None, num_encoder_tokens))
# Define LSTM layer 1 and pass the above encoder input sequence to it.
# note that return_sequences argument must set to be True in order to connect next to layer.
encoder_l1 = LSTM(latent_dim, return_sequences=True, return_state=True)(encoder_inputs)
# Define LSTM layer 2 (encoder)
encoder = LSTM(latent_dim, return_state=True)
# Pass (connect) encoder_l1 to LSTM layer 2 (encoder)
encoder_outputs, state_h, state_c = encoder(encoder_l1)
...

و به همین ترتیب این اقدام باید برای دیگر لایه‌های شبکه هم انجام شود.

تبدیل مدل به یک مدل در سطح واژه

مدل فوق در سطح کاراکتر عمل می‌کند. اگر یک توالی از اعداد صحیح داشته باشیم که هر عدد نشان دهنده شاخص واژه‌ای خاص در یک دیکشنری باشد. می‌توان با استفاده از لایه embedding موجود در keras مدل را برای استفاده از این نشانه‌های عددصحیح آماده کرد. تکه کد زیر این امکان را اضافه می‌کند:

...
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None,))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim,return_state=True)(x)encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)
# Define the model that will turn
#encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Compile & run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# Note that `decoder_target_data` needs to be one-hot encoded,
# rather than sequences of integers like `decoder_input_data`!
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,batch_size=batch_size,epochs=epochs, validation_split=0.2)
...

واژه‌نامه

واژه‌نامه فـارســی به انگلـیسی

واژه‌‌ی فـارسی		معادل انگلیسی
انفجار گرادیان		Exploding Gradient
بانظارت		Supervised
پردازش زبان طبیعی		Natural Language Processing (NLP)
پس‌انتشار		Backpropagation
تابع بیشینه هموار		Softmax Function
تأخیر زمانی کمینه		Minimal Time Lag
ترجمه ماشینی		Machine Translation (MT)
ترجمه ماشینی آماری		Statistical Machine Translation (SMT)
ترجمه ماشینی عصبی		Neural Machine Translation (NMT)
تشخیص گفتار		Speech Recognition
توالی		Sequence
جست‌وجوی پرتوی محلی		Beam Search
حافظه کوتاه مدت بلند		Long-Short Term Memory (LSTM)
دسته		Batch
دوره		Epoch
سرگشتگی		Perplexity
شبکه عصبی پیچشی		Convolutional Neural Network (CNN)
شبکه عصبی رو به جلو ژرف		Deep Feed-forward Neural Network
شبکه عصبی ژرف		Deep Neural Network (DNN)
شبکه عصبی مکرر		(RNN) Recurrent Neural Network
فرضیه جزئی		Partial Hypothesis
کدگذار		Encoder
کدگشا		Decoder
گذر جلو		Forward Pass
مدل زبانی		Language Model (LM)
مدل زبانی عصبی		Neural Language Model (NLM)
میرایی گرادیان		Vanishing Gradient
نشانه‌گذاری شده		Tokenized

پانوشت‌ها

deep neural networks ↩
backpropagation ↩
supervised ↩
natural language processing ↩
sequence ↩
deep feed-forward neural networks ↩
recurrent neural networks ↩
convolutional neural networks ↩
grid ↩
machine translation ↩
speech recognition ↩
long-short term memory ↩
neural machine translation ↩
statistical machine translation ↩
language model ↩
neural language models ↩
forward pass ↩
rectified linear unit ↩
sigmoid ↩
softmax function ↩
n-best list ↩
topic model ↩
batch ↩
epoch ↩
vanishing gradient ↩
exploding gradient ↩
inference ↩
tokenized ↩
beam search ↩
partial hypothesis ↩
perplexity ↩
minimal time lag ↩
چندین نوع محاسبه از امتیاز BLEU وجود دارد کــه هر نوع با یک اسکریپت زبان perl تعریف شده است و در این مقاله از این اسکریپت‌های موجود برای محاسبه امتیاز BLEU استفاده شده است. ↩

Computer science at the heart of civilization

2019-02-22T12:30:00+03:30

These projects have facilitated a modern approach to interdisciplinary problem-solving known as computational thinking. The most fascinating aspect of this work is how solving a problem in one domain often provides insights to address challenges in entirely different domains. The focus of this effort has been the generalization of solutions to make them applicable to a broad spectrum of problems.

Bridging Disciplines Through Computational Thinking

Over the past years, I have successfully integrated computer science into a variety of disciplines, showcasing the transformative potential of computational thinking. This interdisciplinary approach has yielded significant contributions in multiple fields: medicine (developing a non-invasive bladder monitoring system and jaundice prediction models), civil engineering (designing bridge management systems), materials engineering (applying artificial intelligence to inverse material design), railway engineering (testing an interlocking system), sociology (creating an intelligent mixed research framework), sports (developing a swimming competition management system), and molecular physics (utilizing Raman spectroscopy).

Each of these projects exemplifies the power of computational thinking in solving complex, real-world challenges. By abstracting and generalizing solutions, I have not only addressed specific problems within individual domains but also developed methods that can be applied to entirely different fields. This work demonstrates the universality of computational principles and their capacity to drive innovation across disciplines.

Expanding the Horizons of Computational Thinking

The interdisciplinary nature of computational thinking has opened new avenues for exploration. For instance: - Healthcare and AI Integration: Leveraging machine learning for early disease detection, personalized treatments, and non-invasive diagnostic tools. - Environmental Engineering: Employing computational models to analyze climate patterns, optimize renewable energy systems, and develop sustainable urban planning solutions. - Education: Designing intelligent tutoring systems that adapt to individual student needs and incorporating computational thinking into curricula to foster problem-solving skills across all disciplines. - Arts and Humanities: Utilizing computational methods to analyze literary texts, reconstruct historical artifacts, and create generative art. - Ethics and Policy: Developing frameworks to ensure the ethical application of computational technologies and addressing societal implications of automation and AI.

Future Works in Interdisciplinary Computational Thinking

Moving forward, several promising areas warrant further research and development: 1. Adaptive Generalization of Solutions: Refining computational models to ensure their adaptability to diverse and evolving problems across disciplines. 2. Cross-Domain Knowledge Transfer: Investigating methodologies for systematically transferring insights and solutions between domains to foster more efficient and innovative problem-solving. 3. Human-Centric Design in Computational Tools: Focusing on usability and accessibility to ensure that computational solutions are approachable and effective for experts in non-technical fields. 4. Collaboration Platforms for Interdisciplinary Research: Building digital ecosystems that connect researchers from different fields, facilitating collaboration, knowledge sharing, and integrated solution development. 5. Addressing Socio-Technical Challenges: Exploring computational approaches to complex societal challenges, such as public health crises, environmental sustainability, and social equity.

Remark

The interdisciplinary application of computational thinking has not only enhanced my ability to solve problems in diverse fields but has also deepened my understanding of the interconnectedness of knowledge. Each endeavor has been an opportunity to transcend traditional boundaries, fostering innovation that benefits both specialized fields and society at large.

The evolving landscape of computational thinking continues to offer untapped potential for bridging gaps between disciplines. By embracing this modern approach, we can address increasingly complex challenges in a way that leverages the strengths of diverse fields and creates solutions that are both robust and adaptable.

Welcome

2019-02-22T12:30:00+03:30

"Life Is a Fractal Event (Exploration). Create unlimited (infinite) values in limited time!"

Morteza

Hello and welcome to my new personal web-page and blog on GitHub. The blog is under construction and some pages will add in future. Please refer to About me page for more information.

A note about the blogging tool

I recently read about Pelican and decided to switch my blog from pure unstructured HTML to a structured static website. Pelican is really beautiful blogging and publishing tool. Simply put, Pelican is a neat static site generator (SSG) written in Python. Like all SSGs, it enables super fast website generation. Pelican has no heavy docs, straightforward installation, and powerful features such as plugins and extendability. I am new to Pelican, but it is simple and easy to use. I strongly recommend you to use Pelican!