deor

SIC-XE Assembler Implementation in JavaScipt

Introduction

SIC-XE stands for Simplified Instructional Computer Extra Equipment, and is a hypothetical computer system. It is commonly used for instructional use and teaching, and is primarily used and talked about in the textbook, System Software: An Introduction to Systems Programming, by Leland Beck. For this project, I have been tasked with writing an assembler implementation that can translate SIX-XE assembly listing into a solution file that includes assembly listing accompanied with location counters and object code, as well as output an object program that can be hypothetically loaded onto a SIC/XE machine.

For my implementation, I chose to write it in NodeJS. NodeJS is a JavaScript runtime built on Chrome's V8 JavaScript engine that can run outside of the traditional browser. The biggest reason behind choosing javascript was the ease of manipulating data structures compared to its lower level language counterparts. Initializing, accessing, updating, and deleting data structures in javascript can be done very simply, without the hassle of worrying about memory leaks and more. By implementing the assembler in a higher level language, the focus can instead be on the design of the assembler instead of the nuances of the chosen language.

Implementation Overview

The assembler has three different phases to it: Pre-Processing,Pass 1, Pass 2, and Post-Processing. The Pre-Processing phase takes input from a text file, and prepares it for the assembler for Pass 1. Pass 1 builds the symbol table data, program blocks table, assigns locations to each applicable line, and expands macros. Pass 2 takes the output from Pass 1, and goes line by line generating object code and the related object program code, such as the text records and modification records. The last phase is the Post-Processing phase, which finalizes the object program, generates the text records and modification records, and then writes the solution and object program files to the output directory.

Pre-Processing

The Pre-Processing phase is very short and is simple. Its main goal is to separate the tab delimited and new line separated assembly listing into an array which the assembler can parse through. For this, I created a simple file management system which the assembler can use for reading files and writing (Post-Processing also uses this). The assembler is passed a file path, and the assembler calls the file parser with the path, which reads the file and immediately separates it by new lines. This array of new lines is now iterated over, with every line being split by a tab. The result is put into a line object which represents one singular line, and then it is pushed onto an array of lines. The line object includes the line number, symbol, operation, operand, and comment.

Image alt

This array is now returned to the assembler for it to begin Pass 1.

Pass 1

After Pre-Processing is completed, the assemblerbegins Pass 1. The main objective of Pass 1 is to:

  • Create a symbol table
  • Create a program blocks table
  • Recognize different control sections
  • Handle literals & literal pools
  • Expand macros

Pass 1 begins by initializing the necessary data structures. These are the symbol table object, program blocks array of objects, literal array (LITTAB), and namTab, defTab, and argTab for macro related operations. An important design decision I had was with the symbol table, and the way it is organized as an object. Consider the following diagram showcasing the structure:

Image alt

The symbol table object is organized in a key-value pair manner. We can nest objects within the symbol table objects, which is what is being done to handle the logic for various control sections. The reason for this is so that symbols from other control sections can not overwrite symbols from other control sections. It also helps in Pass 2 when accessing the correct symbol, which we will talk more about later on.

To access a symbol, the following syntax must be used:

SymbolTable[current control section][symbol name]

This will result in an object that looks like: { address : “000A”, block: 0 }

The other data structures used are simpler. The program blocks array contains a list of objects. Each object includes the block name, block number, block length, and starting address. The literal array stores lines which are literals, and follow the same object scheme as the line mentioned previously.

For macro related operations, we have three data structures. defTab is where the macro definition is stored, and is accessed via defTab[macroname]. It is an array of lines that make up the macro definition. The namTab stores the location counter of where the macro is first defined, in the form namTab[macroname]. The argTab stores the arguments to pass to the macro, in an array where each element is an argument. The format is argTab[macroname].

After data structure initialization, the assembler now moves on to make its first pass over the array.

The main logic behind Pass 1 is applying every element of the parsed lines array against a switch statement. This switch statement handles the logic for what should be done for each operation that requires an action.

Image alt

Here is a brief overview of what happens during each operation:

  • START
    • Current control section is set to the name of the program.
    • Symbol table entry for this control section is initialized.
    • Default program blocks entry is initialized.
  • END
    • Literal array is iterated over and added to the end of the program.
    • The current program block length is set.
  • EXTREF
    • External references are added to the symbol table in a unique area dedicated to external definitions. Later used during the translation process.
  • EXTDEF
    • External definitions are added to the symbol table in a unique area dedicated to external definitions. Later used during the translation process.
  • CSECT
    • Changes the control section.
    • Add all literals in the current array to the end of the previous control section.
  • USE
    • Changes the current program block.
    • Either resumes a previous block, or starts a new one depending on the operand.
  • RESW/RESB/BYTE/WORD
    • Increments location counter by the related amount depending on the operation.
  • BASE
    • Does nothing except set the location counter for the current line to nothing, to indicate that this line does not generate object code.
  • LTORG
    • Iterates over the literal array and adds each literal starting at the current location.
  • EQU
    • Handles related EQU operations and arithmetic.
  • *
    • Does nothing
  • MACRO
    • Creates macro definition, fills out defTab, namTab, and argTab with respective info.
  • MEND
    • Does nothing as the previous MACRO operation logic runs until it sees MEND.
  • DEFAULT
    • This is called if no other switch condition fits the operation.
    • This either expands the macro, if the macro is being called, or it simply increments the location counter if necessary.

After Pass 1 completes iterating over the array, it finalizes the program blocks length, and returns the program blocks array, symbol table, and lines with location counters to the assembler. The assembler then immediately begins on Pass 2.

Pass 2

The goal of this pass is to generate the object code for each instruction, and use it to generate the object program. The object code for an SIC-XE instruction can be generated in four different formats, each format can vary in size, fields, and addressing modes. Addressing modes tells the machine running the program how to access and operate on different memory locations. for example if the instruction is utilizing immediate addressing, where the value for the instruction is provided within the object code, such as the operand #100, where 100 is the value for the instruction, or if the instruction is utilizing another form of addressing where the value needs to be fetched from another place in memory.

Due to these different methods of addressing, generating object code is one of the most complicated parts of this implementation of the assembler. Any small mistakes can result in an object code being generated incorrectly.

Pass 2 utilizes a unique system implemented for this assembler called a program state. It is used to indicate the current state of the program, and is used by the many parts that make up Pass 2.

Image alt

To begin Pass 2, the assembler invokes the generate object code function, and passes the parsed lines which includes location counters, as well as the symbol table and program blocks. Because Pass 2 is spread out across multiple processing areas, we need a central place to store the current state of some key variables:

  • Current control section
  • Current block
  • Symbol Table
  • Program Block array
  • Base register

The program state is implemented in a very simple manner, with getters and setters for each important state member.

Image alt

When Pass 2 begins, it adds the symbol table and the program block array to the state. The next step is for the assembler to begin iterating over the parsed lines from Pass 1. The base logic for this is similar to the logic for Pass 1, where each array element is applied against a switch statement to handle each operation which has an applicable action.

Image alt

Here is a brief overview of what happens during each operation:

  • START
    • Sets the program state for the current control section and the current program block.
  • BYTE/WORD
    • Generates the object code for a BYTE/WORD instruction if applicable.
  • BASE
    • Sets the base register in the program state.
  • CSECT
    • Sets the program state for the current control section and the current program block.
  • USE
    • Sets the current program block in the program state.
  • *
    • Generates the object code for the literal.
  • DEFAULT
    • The majority of the processing is spent here. This is the area where the object code for the majority of the instructions is generated.
    • First, the operation code for the respective operation is generated.
    • Next, the operand address, instruction format, and if it is a format 2 instruction, r1/r2 have values, are computed.
    • The flags are then computed by passing the operation, operand, operand address, format, and location counter. These flags are used to determine the addressing modes.
    • Finally, the object code is generated by invoking a function which serves as a router to handle the object code generation for each format.
Image alt Image alt

This function routes the assembler to another function via a switch statement which generates the respective object code. In this implementation, the v4 format is handled by the v3 function, as the difference between v4 and v3 is very little, and v3 has special conditions which adapt to a v4 format. This was done to avoid redundant code.

The generation of the object code is done by checking the flags provided, and routing the assembler to the code to generate the object code that follows the rules decided by the flags. Due to the complexity of making this simple, it is currently just an if-else-if chain for matching flags. In the future, this would be an area to re-write and be made more simpler for readability and robustness.

Image alt Image alt

In an attempt to make the code more readable and less complex, much of the operations to convert the operation code, flags, and operand address to binary from hex, and to object code is made into separate functions. These functions are also simple, and moving them away from the main control flow helps lessen the large amount of operations being done in one code area.

Image alt Image alt

After the object code is generated for a line, two functions are invoked, one for generating text records, and one for generating modification records. If applicable to each context, the assembler will automatically add the text records and modification records to an array of objects. These text record objects include object code, the start address of the record, and the current length of the line. The modification records include the start address, control section (if applicable), and length of the object code for that line.

While the applicable lines are being added to the text records, if the current program context changes, such as the control section is changed, the object program up until that point is automatically generated and is kept in a local variable. This variable is later retrieved by the assembler in the Post Processing phase.

Post Processing

The Post Processing sequence is very simple, and mostly consists of the process of the assembler building the final object program by combining each necessary part, and then writing the object program and solution files to an output directory.

The assembler completed Pass 2 with the array of lines from before now including object code, which can then be written to a solutions file. The assembler then calls a function to build out the remaining object program

Image alt

The final step is to write the files. For the solution file, the assembler loops over each line in the array and writes it to a file, separating items in the objects by tabs and lines by new lines. The object program code is already separated by spaces and new lines, and is simply written to an object program file.

Once this is completed, the Post Processing phase cleans up all global variables for further assembling.

Implementation Process

The implementation for this assembler was met with a lot of hardship regarding debugging. The majority of the time spent on the assembler was debugging two areas, the flag generation, and the object program generation.

Flag generation seems to be simple, but can get quite complicated after considering all of the different addressing modes and varying ways to generate object code. I spent lots of time re-reading section 2.2-2.3 in the textbook, trying to find reasons for why my object code is not correct. After hours of debugging, I was able to resolve my bugs regarding flags and arrived at the current flag function. I actively worked towards making it more robust and simpler, but struggled a lot.

Another large hurdle was adding new features, such as macros, control sections, and program blocks, without running into issues with breaking previous features. This became very messy very fast when all three methods were implemented. This resulted in me having to re-write the design of the assembler to the current program state method, which helped tremendously as it centralized where the state of some important variables are.