Aiming at enhancing our understanding of how the central processing unit (CPU) works, this project invites us to build a MIPS single-cycle processor and a MIPS pipelined processor using Verilog.

Full report [↓]

I. Objectives

  • To build up both single-cycle and pipelined datapaths and construct a simple version of a processor sufficient to implement a subset of the MIPS instruction set that includes:
    1. The memory-reference instructions load word (lw) and store word (sw)
    2. The arithmetic-logical instructions add, addi, sub, and, andi, or, and slt
    3. The jumping instructions branch equal (beq), branch not equal (bne), and jump (j)
  • To model and simulate the single-cycle and pipelined datapaths in Verilog HDL
  • To synthesize the results by using Xilinx synthesis tools

Introduction

Aiming at enhancing our understanding of how the central processing unit (CPU) works, this project invites us to build a MIPS single-cycle processor and a MIPS pipelined processor using Verilog.

The single-cycle implementation executes all instructions in one clock cycle. This means that no datapath resource can be used more than once per instruction, so any element needed more than once must be duplicated. Therefore, a memory for instructions is separated from one for data. Although some of the functional units will need to be duplicated, many of the elements can be shared by different instruction flows. In order to share a datapath element between two or more different instruction classes, multiple connections to the input of an element are allowed, and multiplexors and control signals are designed carefully to select among the multiple inputs.

However, the single-cycle implementation has an unsatisfactory performance due to its inefficiency; the execution time of each instruction is one clock cycle, which is determined by the longest possible path in the processor. Using the same hardware components, the pipelined implementation allows different functional units of a system to run concurrently. As a result, pipelining technique significantly reduces the execution time of a same program compared to the single-cycle implementation.

In order to prevent data hazards in the pipeline and minimize the delay created by stalls, a forwarding unit and a hazard detection unit are implemented. The forwarding technique retrieves the missing data element from internal buffers rather than waiting for it to arrive from programmer-visible registers or memory. The hazard detection unit operates during the ID stage so that it can insert the stall between the load and its use; in this case, stalling is inevitable. For control hazards that might occur during the operation of jumping instructions, we assume branch is not taken and stall if the assumption is not correct during the ID stage via the hazard detection unit. For the case when there is a load before a branch, unfortunately, as stated before, stalls are added so that the word is successfully written in the instruction memory.

II. Circuit Design

i. Single-cycle datapath

Our project follows the following single-cycle processor schematic from the textbook. This version of the MIPS single-cycle processor can execute the following instructions: add, sub, and, or, slt, lw, sw, beq, bne, addi, andi and j.

The model divides the machine into two main units: control and data paths. Each unit consists of various functional blocks, such as a 32-bit ALU, register file, sign extension logic, and five multiplexers to select the appropriate operand.

For the ALU control unit and the 32-bit ALU, our design follows six combinations of four control inputs in the textbook. The figure below shows how to set the ALU control bits according to the different function codes of the ALUOp control bit and the R type command.

ii. Pipelined datapath

Our project follows the following pipelined processor schematic from the textbook. This version of the MIPS single-cycle processor can execute the following instructions: add, sub, and, or, slt, lw, sw, beq, bne, addi, andi and j.

III. Design of Components

i. IF Stage

1. PC MUX

The PC multiplexor controls what value replaces the PC (PC+4, the branch destination address or the jump destination address). The Verilog code for the PC multiplexor is

module Mux_N_bit(in1,in2,out,select);
    parameter N = 32;
    input[N-1:0] in1,in2;
    input select;
    output[N-1:0] out;
    assign out = select?in2:in1;
endmodule

2. PC Register

The program counter is a 32-bit register that is written at the end of every clock cycle and thus does not need a write control signal. The Verilog code for PC register is

module PC (
     input               clk,
                         PCWrite,
     input       [31:0]  in,
     output reg  [31:0]  out
 );
 
     initial begin
         out = 32'b0;
     end
 
     always @ (posedge clk) begin
         if (PCWrite)
             out <= in;
     end
 
 endmodule

3. PC Adder 4

The PC adder is an ALU wired to always add its two 32-bit inputs and place the sum on its output. The Verilog code for the adder is

module adder(
     input [31:0] a,
     input [31:0] b,
     output [31:0] sum
     );
     reg [31:0] sum;
     always @(a or b)
         begin
             sum = a + b;
         end
 endmodule

4. Instruction Memory

The instruction memory only reads, we treat it as a combinational logic: the output at any time reflects the contents of the location specified by the address input, and no read control signal is needed. The Verilog code for the instruction memory is

module instruction_memory (
      input       [31:0]  address,
      output      [31:0]  instruction
  );
  
      parameter size = 128; // you can change here, size is the max size of memory
      integer i;
      // initialize memory
      reg [31:0] memory [0:size-1];
      // clear all memory to nop
      initial begin
          for (i = 0; i < size; i = i + 1)
              memory[i] = 32'b0;
          // include the instruction_memory
          `include "InstructionMem_for_P2_Demo.txt"
      end
      // Output the menmory at address
      assign instruction = memory[address >> 2];
  
  endmodule

ii. EX Stage

1. Forwarding Unit and MUX

The forwarding unit detects the hazards in the EX and MEM stage and assigns the ALU forwarding control of the multiplexors. According to the conditions for the two kinds of data hazards, the Verilog code for the forwarding unit module is written as follow.

module Forward (
      input       [4:0]   registerRsID,
                          registerRtID,
                          registerRsEX,
                          registerRtEX,
                          registerRdMEM,
                          registerRdWB,
      input               regWriteMEM,
                          regWriteWB,
      output reg  [1:0]   forwardA,
                          forwardB,
      output reg          forwardC,
                          forwardD
  );
  
      initial begin
          forwardA = 2'b00;
          forwardB = 2'b00;
          forwardC = 1'b0;
          forwardD = 1'b0;
      end
  
      always @ ( * ) begin
          if (regWriteMEM && registerRdMEM && registerRdMEM == registerRsEX)
              forwardA = 2'b10;
          else if (regWriteWB && registerRdWB && registerRdWB == registerRsEX)
              forwardA = 2'b01;
          else
              forwardA = 2'b00;
  
          if (regWriteMEM && registerRdMEM && registerRdMEM == registerRtEX)
              forwardB = 2'b10;
          else if (regWriteWB && registerRdWB && registerRdWB == registerRtEX)
              forwardB = 2'b01;
          else
              forwardB = 2'b00;
  
          if (regWriteMEM && registerRdMEM && registerRdMEM == registerRsID)
              forwardC = 1'b1;
          else
              forwardC = 1'b0;
  
          if (regWriteMEM && registerRdMEM && registerRdMEM == registerRtID)
              forwardD = 1'b1;
          else
              forwardD = 1'b0;
      end
  
  endmodule // Forward

As the pipeline registers hold the data to be forwarding, we take inputs to the ALU from any pipeline register rather than jus ID/EX, so that the proper data can be forwarded to the next stage. By adding multiplexors to the input of the ALU, and with the proper controls determined by the forwarding unit showed in the following figure, the pipeline can run at full speed in the presence of the data dependences.

image-20181123093904906

The Verilog code for each 3-to-1 Mux is

module Mux_32bit_3to1 (in00, in01, in10, mux_out, control);
    input [31:0] in00, in01, in10;
    output [31:0] mux_out;
    input [1:0] control;
    reg [31:0] mux_out;
    always @(in00 or in01 or in10 or control)
    begin
        case(control)
        2'b00:mux_out<=in00;
        2'b01:mux_out<=in01;
        2'b10:mux_out<=in10;
        default: mux_out<=in00;
        endcase
    end 
endmodule

2. ALU Control

The ALU control unit generates a 4-bit control input to the ALU with the function field of the instruction and a 2-bit control field ALUOp determined by the main control unit. Our design follows the that in the textbook that shows how the ALU control inputs are set based on the 2-bit ALUOp control and the 6-bit function code.

image-20181123093825866

The Verilog code for the ALU control module is

module ALU_control(
      funct,ALUOp,ALUCtrl
      );
      input [5:0] funct;
      input [1:0] ALUOp;
      output [3:0] ALUCtrl;
      reg [3:0] ALUCtrl;
      always @ (funct or ALUOp)
      begin
          case(ALUOp)
              2'b00: assign ALUCtrl = 4'b0010;
              2'b01: assign ALUCtrl = 4'b0110;
              2'b10:
                  begin
                      if (funct == 6'b100000)      assign ALUCtrl = 4'b0010;
                      else if (funct == 6'b100010) assign ALUCtrl = 4'b0110;
                      else if (funct == 6'b100100) assign ALUCtrl = 4'b0000;
                      else if (funct == 6'b100101) assign ALUCtrl = 4'b0001;
                      else if (funct == 6'b101010) assign ALUCtrl = 4'b0111;
                  end
              2'b11:
                  assign ALUCtrl = 4'b0000;            
              // default: assign ALUCtrl = 4'b1111;
          endcase
      end
  endmodule

3. ALU

Depending on the ALU control lines, the ALU performs one of the functions shown in the following table.

image-20181123093919177

For load word and store word instructions, ALU computes the memory address by addition. For the R-type instructions, ALU performs one of the five actions: and, or add, subtract, or set on less than. For the instruction beq, the ALU performs subtraction. The Verilog code for the ALU is

`ifndef MODULE_ALU
`define MODULE_ALU
`timescale 1ns / 1ps
module ALU(
    ALUCtrl,a,b,zero,ALU_result
    );
    input [3:0] ALUCtrl;
    input [31:0] a, b;
    output zero;
    output [31:0] ALU_result;
    reg zero;
    reg [31:0] ALU_result;
    always @ (a or b or ALUCtrl)
    begin
        case (ALUCtrl)
            4'b0000:
            begin
                assign ALU_result = a & b;
                assign zero = (a & b == 0) ? 1:0;
            end
            4'b0001:
            begin
                assign ALU_result = a | b;
                assign zero = (a | b == 0) ? 1:0;
            end
            4'b0010:
            begin
                assign ALU_result = a + b;
                assign zero = (a + b == 0) ? 1:0;
            end
            4'b0110:
            begin
                assign ALU_result = a - b;
                assign zero = ( a == b) ? 1:0;
            end
            4'b0111:
            begin
                assign ALU_result = (a < b) ? 1:0;
                assign zero = (a < b) ? 0:1;
            end
            default:
            begin
                assign ALU_result = a;
                assign zero = (a == 0) ? 1:0;
            end
        endcase    
    end
endmodule
`endif

iii. Memory Stage and Write Back Stage

1. Data Memory

This part is used to save the data to a data memory. The data memory must be written on store instructions; hence, data memory has read and write control signals, an address input, and an input for the data to be written into memory.

module data_memory (
    input            clk,
    input            MemRead,
                 	 MemWrite,
    input    [31:0]  address,
                     write_data,
    output   [31:0]  read_data
);
    parameter        size = 64;	// size of data register_memory
    integer          i;
    wire     [31:0]  index;
    reg      [31:0]  register_memory [0:size-1];
    assign index = address >> 2; // address/4 
    initial begin
        for (i = 0; i < size; i = i + 1)
            register_memory[i] = 32'b0;
        // read_data = 32'b0;
        // wire can not be set within a initial.
    end
    always @ ( posedge clk ) begin
        if (MemWrite == 1'b1) begin
            register_memory[index] = write_data;
        end
    end
    assign read_data = (MemRead == 1'b1)?register_memory[index]:32'b0;
endmodule

VI. Control and Data Hazard

i. EXstage

Data Hazard

Branch prediction and forwarding help make a computer fast while still getting the right answers. This forwarding control will be in the EX stage, because the ALU forwarding multiplexors are found in that stage. Thus, we must pass the operand register numbers from the ID stage via the ID/EX pipeline register to determine whether to forward values.

if (MEM/WB.RegWrite and (MEM/WB.RegisterRd0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) )
ForwardA = 01

if (MEM/WB.RegWrite and (MEM/WB.RegisterRd0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) )
ForwardB = 01

According to the figure belowe and the logic above, we can get the verilog program belowe.

module Forwarding(
    // Input Register
    input   [4:0]   ID_EX_Reg_Rt,
                    ID_EX_Reg_Rs,
                    IF_ID_Reg_Rs,
                    IF_ID_Reg_Rt,
                    EX_MEM_Reg_Rd,
                    MEM_WB_Reg_Rd,
    // Input control signal
    input           EX_MEM_RegWrite,
                    MEM_WB_RegWrite,
    //  Forward for ALU input
    output  reg [1:0]   Forward_ALU_A,
                        Forward_ALU_B,
    //  Forward for state reg input
    output  reg     Forward_C,
                    Forward_D
);
    always @(*)
        begin
            if(EX_MEM_RegWrite && (EX_MEM_Reg_Rd != 0) && (EX_MEM_Reg_Rd == ID_EX_Reg_Rs))
                Forward_ALU_A = 2'b10;
            // ID_EX for ALU
            else if(MEM_WB_RegWrite && (MEM_WB_Reg_Rd != 0) && (MEM_WB_Reg_Rd == ID_EX_Reg_Rs))
                Forward_ALU_A = 2'b01;
            // MEM_WB for ALU
            else
                Forward_ALU_A = 2'b00;
            // Regsiter for ALU

            if(EX_MEM_RegWrite && (EX_MEM_Reg_Rd != 0) && (EX_MEM_Reg_Rd == ID_EX_Reg_Rt))
                Forward_ALU_B = 2'b10;
            else if(MEM_WB_RegWrite && (MEM_WB_Reg_Rd != 0) && (MEM_WB_Reg_Rd == ID_EX_Reg_Rt))
                Forward_ALU_B = 2'b01;
            else
                Forward_ALU_B = 2'b00;
            //Select the data input of ID_EX_State_reg
            if(EX_MEM_RegWrite && (MEM_WB_Reg_Rd != 0) && (MEM_WB_Reg_Rd == IF_ID_Reg_Rs))
                Forward_C = 1;
            else
                Forward_C = 0;
                
            if(EX_MEM_RegWrite && (MEM_WB_Reg_Rd != 0) && (MEM_WB_Reg_Rd == IF_ID_Reg_Rt))
                Forward_D = 1;
            else
                Forward_D = 0;
        end
endmodule

ii. ID stage

Data Hazard

We use the Hazard Detection Unit to detect the Load-Use Hazard and stall the pipline by one clock cycle.

ID/EX.MemRead and
((ID/EX.RegisterRt == IF/ID.RegisterRs) or
(ID/EX.RegisterRt == IF/ID.RegisterRt))

With the logic shown above, we can get the verilog program belowe.

module hazard_detection(
    input [4:0] ID_Reg_Rt, 
                IF_Reg_Rs, 
                IF_Reg_Rt,

    input       EX_MemRead,
                EX_regWirte,
                MEM_MemRead,
                Ins_Beq,
                Ins_Bne,
    
    output  reg     stall,
                    flush
);
    initial begin
        stall = 1'b0;
        flush = 1'b0;
    end
    always @(*) begin
        if(EX_MemRead&& ID_Reg_Rt&& ((ID_Reg_Rt == IF_Reg_Rs) || (ID_Reg_Rt == IF_Reg_Rt)))
        begin
            stall = 1'b1;
            flush = 1'b1;
        end else if(Ins_Beq || Ins_Bne) begin
            if(EX_regWirte && ID_Reg_Rt&&((ID_Reg_Rt == IF_Reg_Rs) || (ID_Reg_Rt == IF_Reg_Rt))) begin
                stall = 1'b1;
                flush = 1'b1;
            end else if (MEM_MemRead &&ID_Reg_Rt&& ((ID_Reg_Rt == IF_Reg_Rs) || (ID_Reg_Rt == IF_Reg_Rt))) begin
                stall = 1'b1;
                flush = 1'b1;
            end else begin
                stall = 1'b1;
                flush = 1'b1;                
            end
        end else begin
            stall = 1'b0;
            flush = 1'b0;
        end
    end
endmodule
Control Hazard

We have limited our concern to hazards involving arithmetic operations and data transfers. However, there are also pipeline hazards involving branches. An instruction must be fetched at every clock cycle to sustain the pipeline, yet in our design the decision about whether to branch doesn’t occur until the MEM pipeline stage. This delay in determining the proper instruction to fetch is called a control hazard or branch hazard, in contrast to the data hazards we have just examined.

We assume branch not taken and use dynamic branch prediction. We only change prediction on two successive mispredictions.

module HazardDetection (
    input               branchEqID,
                        branchNeID,
                        memReadEX,
                        regWriteEX,
                        memReadMEM,
    input       [4:0]   registerRsID,
                        registerRtID,
                        registerRtEX,
                        registerRdEX,
                        registerRdMEM,
    output reg          stall,
                        flush
);

    initial begin
        stall = 1'b0;
        flush = 1'b0;
    end

    always @ ( * ) begin
        if (memReadEX && registerRtEX && (registerRtEX == registerRsID || registerRtEX == registerRtID)) begin
            stall = 1'b1;
            flush = 1'b1;
        end else if (branchEqID || branchNeID) begin
            if (regWriteEX && registerRdEX && (registerRdEX == registerRsID || registerRdEX == registerRtID)) begin
                stall = 1'b1;
                flush = 1'b1;
            end else if (memReadMEM && registerRdMEM && (registerRdMEM == registerRsID || registerRdMEM == registerRtID)) begin
                stall = 1'b1;
                flush = 1'b1;
            end else begin
                stall = 1'b0;
                flush = 1'b0;
            end
        end else begin
            stall = 1'b0;
            flush = 1'b0;
        end
    end

endmodule

VII. Instruction Implementation

i. Data tables

The following table shows the setting of the control lines for each instruction in our design.

Jump ALUSrc RegDst ALUOp MemWrite MemRead Branch MemtoReg RegWrite
R-type 0 0 1 10 0 0 0 0 1
addi 0 1 0 00 0 0 0 0 1
andi 0 1 0 00 0 0 0 0 1
slt 0 1 0 00 0 0 0 0 1
beq 0 0 0 01 0 0 1 0 0
bne 0 0 0 01 0 0 1 0 0
lw 0 1 0 00 0 1 0 1 1
sw 0 1 0 00 1 0 0 0 0
j 1 0 0 00 0 0 0 0 0

VIII. SSD and Top Module

The Basys 3 board contains a four-digit common anode SSD. The anodes of the seven LEDs forming each
digit are tied together into one “common anode” circuit node, but the LED cathodes remain separate. The
common anode signals are available as four “digit enable” input signals to the 4-digit display. The cathodes of
similar segments on all four displays are connected into seven circuit nodes labeled CA through CG (so, for
example, the four d cathodes from the four digits are grouped together into a single circuit node called “CD”).
These seven cathode signals are available as inputs to the 4-digit display. This signal connection scheme makes
the cathode signals common to all digits but they can only illuminate the segments of a digit whose
corresponding anode signal is asserted.

module SSD(Q,out);
    input [3:0] Q;
    output [6:0] out;
    reg [6:0] out;
    always @(Q) begin
        case(Q)
        4'b0000: out = 7'b0000001;
        4'b0001: out = 7'b1001111;
        4'b0010: out = 7'b0010010;
        4'b0011: out = 7'b0000110;
        4'b0100: out = 7'b1001100;
        4'b0101: out = 7'b0100100;
        4'b0110: out = 7'b0100000;
        4'b0111: out = 7'b0001111;
        4'b1000: out = 7'b0000000;
        4'b1001: out = 7'b0000100;
        default: out = 7'b1111110; 
    endcase
end
endmodule

IX. RTL Schematic (Optional)

Please check the RTL schematic on the next two pages.

X. Textual Result

By running the instruction memory, we can achieve appendix part.

XI. Conclusion and Discussion

In this project, a MIPS single-cycle processor and a MIPS pipelined processor are modeled, implemented using Verilog and simulated in Vivado. The pipelined processor program is run on a Xilinx FPGA board, which is portal to the Vivado Design Suite. The results on board coincide with the simulation results.

Throughout the implementation process, some unexpected events occurred and time were spent on debugging the modules. The events can be divided into two categories. The following discusses the reasons that led to the events.

  1. Simulation errors in Verilog HDL
    • For the implementation of the and instruction in the ALU module, logical AND ‘&&’ should be used rather than bitwise AND ‘&’. Using bitwise AND will lead to problematic results. They are two different logical operations.
    • In the single cycle processor, due to the rising edge in the first clock cycle, if the first instruction is ‘lw’, error would occur. Therefore, the program counter starts from ‘-4’ instead of ‘0’.
    • When coding the main processor module, where a large number of wires are connected to run the program, the naming of wires was not paid attention to. This led to error in wire connections. The reason is that Verilog is a case-sensitive language. For example, WireReg and wireReg actually refer to different wires, resulting in the error.
    • The naming of the pipeline register wasn’t consistent for different components, some represented the output of a pipeline and some were the input of a pipeline. It led to some errors when simulating the processor too.
    • Besides, during the debug process, we learned that the assign statement used for continuous assignment in Verilog can only drive a value to a net but it cannot assign a value to a register.
  2. Unexpected results on FPGA board
    • Because the events in Verilog mostly happen at ‘posedge Clock’ or ‘negedge Clock’, when the switch spends more than half a clock cycle time in between the ‘on’ and ‘off’ position, the program counter will not be determined and hence leads to varying random numbers.

By accomplishing the project, our understanding of how the central processing unit (CPU) works is deepened through practice. The mechanisms of forwarding and flushing in order to prevent hazards are carefully comprehended during the design process. While theoretical knowledge is the foundation, as programmers, the use of programming language has a series of rules and conventions to follow. Especially for naming, the habit of consistency is important and will save a lot of time in the debug process, no matter what programming project we work on in the future.

X. Reference

Hennessy, & JohnL. (1998). Computer organization and design:the hardware/software interface. Morgan Kaufmann Publishers.

Patterson, D. A., & Hennessy, J. L. (2014). Computer organization and design, fourth edition. Tailieu Vn.