program
Digital Design Using Verilog to Implement Single-Cycle & Pipelined 32-Bit Processors
- 2018-11-23
- 胡炳城, Bingcheng
Aiming at enhancing our understanding of how the central processing unit (CPU) works, this project invites us to build a MIPS single-cycle processor and a MIPS pipelined processor using Verilog.
Full report [↓]
I. Objectives
- To build up both single-cycle and pipelined datapaths and construct a simple version of a processor sufficient to implement a subset of the MIPS instruction set that includes:
- The memory-reference instructions load word (
lw
) and store word (sw
) - The arithmetic-logical instructions
add, addi, sub, and, andi, or
, andslt
- The jumping instructions branch equal (
beq
), branch not equal (bne
), and jump (j
)
- The memory-reference instructions load word (
- To model and simulate the single-cycle and pipelined datapaths in Verilog HDL
- To synthesize the results by using Xilinx synthesis tools
Introduction
Aiming at enhancing our understanding of how the central processing unit (CPU) works, this project invites us to build a MIPS single-cycle processor and a MIPS pipelined processor using Verilog.
The single-cycle implementation executes all instructions in one clock cycle. This means that no datapath resource can be used more than once per instruction, so any element needed more than once must be duplicated. Therefore, a memory for instructions is separated from one for data. Although some of the functional units will need to be duplicated, many of the elements can be shared by different instruction flows. In order to share a datapath element between two or more different instruction classes, multiple connections to the input of an element are allowed, and multiplexors and control signals are designed carefully to select among the multiple inputs.
However, the single-cycle implementation has an unsatisfactory performance due to its inefficiency; the execution time of each instruction is one clock cycle, which is determined by the longest possible path in the processor. Using the same hardware components, the pipelined implementation allows different functional units of a system to run concurrently. As a result, pipelining technique significantly reduces the execution time of a same program compared to the single-cycle implementation.
In order to prevent data hazards in the pipeline and minimize the delay created by stalls, a forwarding unit and a hazard detection unit are implemented. The forwarding technique retrieves the missing data element from internal buffers rather than waiting for it to arrive from programmer-visible registers or memory. The hazard detection unit operates during the ID stage so that it can insert the stall between the load and its use; in this case, stalling is inevitable. For control hazards that might occur during the operation of jumping instructions, we assume branch is not taken and stall if the assumption is not correct during the ID stage via the hazard detection unit. For the case when there is a load before a branch, unfortunately, as stated before, stalls are added so that the word is successfully written in the instruction memory.
II. Circuit Design
i. Single-cycle datapath
Our project follows the following single-cycle processor schematic from the textbook. This version of the MIPS single-cycle processor can execute the following instructions: add, sub, and, or, slt, lw, sw, beq, bne, addi, andi and j.
The model divides the machine into two main units: control and data paths. Each unit consists of various functional blocks, such as a 32-bit ALU, register file, sign extension logic, and five multiplexers to select the appropriate operand.
For the ALU control unit and the 32-bit ALU, our design follows six combinations of four control inputs in the textbook. The figure below shows how to set the ALU control bits according to the different function codes of the ALUOp control bit and the R type command.
ii. Pipelined datapath
Our project follows the following pipelined processor schematic from the textbook. This version of the MIPS single-cycle processor can execute the following instructions: add, sub, and, or, slt, lw, sw, beq, bne, addi, andi and j.
III. Design of Components
i. IF Stage
1. PC MUX
The PC multiplexor controls what value replaces the PC (PC+4, the branch destination address or the jump destination address). The Verilog code for the PC multiplexor is
module Mux_N_bit(in1,in2,out,select);
parameter N = 32;
input[N-1:0] in1,in2;
input select;
output[N-1:0] out;
assign out = select?in2:in1;
endmodule
2. PC Register
The program counter is a 32-bit register that is written at the end of every clock cycle and thus does not need a write control signal. The Verilog code for PC register is
module PC (
input clk,
PCWrite,
input [31:0] in,
output reg [31:0] out
);
initial begin
out = 32'b0;
end
always @ (posedge clk) begin
if (PCWrite)
out <= in;
end
endmodule
3. PC Adder 4
The PC adder is an ALU wired to always add its two 32-bit inputs and place the sum on its output. The Verilog code for the adder is
module adder(
input [31:0] a,
input [31:0] b,
output [31:0] sum
);
reg [31:0] sum;
always @(a or b)
begin
sum = a + b;
end
endmodule
4. Instruction Memory
The instruction memory only reads, we treat it as a combinational logic: the output at any time reflects the contents of the location specified by the address input, and no read control signal is needed. The Verilog code for the instruction memory is
module instruction_memory (
input [31:0] address,
output [31:0] instruction
);
parameter size = 128; // you can change here, size is the max size of memory
integer i;
// initialize memory
reg [31:0] memory [0:size-1];
// clear all memory to nop
initial begin
for (i = 0; i < size; i = i + 1)
memory[i] = 32'b0;
// include the instruction_memory
`include "InstructionMem_for_P2_Demo.txt"
end
// Output the menmory at address
assign instruction = memory[address >> 2];
endmodule
ii. EX Stage
1. Forwarding Unit and MUX
The forwarding unit detects the hazards in the EX and MEM stage and assigns the ALU forwarding control of the multiplexors. According to the conditions for the two kinds of data hazards, the Verilog code for the forwarding unit module is written as follow.
module Forward (
input [4:0] registerRsID,
registerRtID,
registerRsEX,
registerRtEX,
registerRdMEM,
registerRdWB,
input regWriteMEM,
regWriteWB,
output reg [1:0] forwardA,
forwardB,
output reg forwardC,
forwardD
);
initial begin
forwardA = 2'b00;
forwardB = 2'b00;
forwardC = 1'b0;
forwardD = 1'b0;
end
always @ ( * ) begin
if (regWriteMEM && registerRdMEM && registerRdMEM == registerRsEX)
forwardA = 2'b10;
else if (regWriteWB && registerRdWB && registerRdWB == registerRsEX)
forwardA = 2'b01;
else
forwardA = 2'b00;
if (regWriteMEM && registerRdMEM && registerRdMEM == registerRtEX)
forwardB = 2'b10;
else if (regWriteWB && registerRdWB && registerRdWB == registerRtEX)
forwardB = 2'b01;
else
forwardB = 2'b00;
if (regWriteMEM && registerRdMEM && registerRdMEM == registerRsID)
forwardC = 1'b1;
else
forwardC = 1'b0;
if (regWriteMEM && registerRdMEM && registerRdMEM == registerRtID)
forwardD = 1'b1;
else
forwardD = 1'b0;
end
endmodule // Forward
As the pipeline registers hold the data to be forwarding, we take inputs to the ALU from any pipeline register rather than jus ID/EX, so that the proper data can be forwarded to the next stage. By adding multiplexors to the input of the ALU, and with the proper controls determined by the forwarding unit showed in the following figure, the pipeline can run at full speed in the presence of the data dependences.
The Verilog code for each 3-to-1 Mux is
module Mux_32bit_3to1 (in00, in01, in10, mux_out, control);
input [31:0] in00, in01, in10;
output [31:0] mux_out;
input [1:0] control;
reg [31:0] mux_out;
always @(in00 or in01 or in10 or control)
begin
case(control)
2'b00:mux_out<=in00;
2'b01:mux_out<=in01;
2'b10:mux_out<=in10;
default: mux_out<=in00;
endcase
end
endmodule
2. ALU Control
The ALU control unit generates a 4-bit control input to the ALU with the function field of the instruction and a 2-bit control field ALUOp determined by the main control unit. Our design follows the that in the textbook that shows how the ALU control inputs are set based on the 2-bit ALUOp control and the 6-bit function code.
The Verilog code for the ALU control module is
module ALU_control(
funct,ALUOp,ALUCtrl
);
input [5:0] funct;
input [1:0] ALUOp;
output [3:0] ALUCtrl;
reg [3:0] ALUCtrl;
always @ (funct or ALUOp)
begin
case(ALUOp)
2'b00: assign ALUCtrl = 4'b0010;
2'b01: assign ALUCtrl = 4'b0110;
2'b10:
begin
if (funct == 6'b100000) assign ALUCtrl = 4'b0010;
else if (funct == 6'b100010) assign ALUCtrl = 4'b0110;
else if (funct == 6'b100100) assign ALUCtrl = 4'b0000;
else if (funct == 6'b100101) assign ALUCtrl = 4'b0001;
else if (funct == 6'b101010) assign ALUCtrl = 4'b0111;
end
2'b11:
assign ALUCtrl = 4'b0000;
// default: assign ALUCtrl = 4'b1111;
endcase
end
endmodule
3. ALU
Depending on the ALU control lines, the ALU performs one of the functions shown in the following table.
For load word and store word instructions, ALU computes the memory address by addition. For the R-type instructions, ALU performs one of the five actions: and, or add, subtract, or set on less than. For the instruction beq, the ALU performs subtraction. The Verilog code for the ALU is
`ifndef MODULE_ALU
`define MODULE_ALU
`timescale 1ns / 1ps
module ALU(
ALUCtrl,a,b,zero,ALU_result
);
input [3:0] ALUCtrl;
input [31:0] a, b;
output zero;
output [31:0] ALU_result;
reg zero;
reg [31:0] ALU_result;
always @ (a or b or ALUCtrl)
begin
case (ALUCtrl)
4'b0000:
begin
assign ALU_result = a & b;
assign zero = (a & b == 0) ? 1:0;
end
4'b0001:
begin
assign ALU_result = a | b;
assign zero = (a | b == 0) ? 1:0;
end
4'b0010:
begin
assign ALU_result = a + b;
assign zero = (a + b == 0) ? 1:0;
end
4'b0110:
begin
assign ALU_result = a - b;
assign zero = ( a == b) ? 1:0;
end
4'b0111:
begin
assign ALU_result = (a < b) ? 1:0;
assign zero = (a < b) ? 0:1;
end
default:
begin
assign ALU_result = a;
assign zero = (a == 0) ? 1:0;
end
endcase
end
endmodule
`endif
iii. Memory Stage and Write Back Stage
1. Data Memory
This part is used to save the data to a data memory. The data memory must be written on store instructions; hence, data memory has read and write control signals, an address input, and an input for the data to be written into memory.
module data_memory (
input clk,
input MemRead,
MemWrite,
input [31:0] address,
write_data,
output [31:0] read_data
);
parameter size = 64; // size of data register_memory
integer i;
wire [31:0] index;
reg [31:0] register_memory [0:size-1];
assign index = address >> 2; // address/4
initial begin
for (i = 0; i < size; i = i + 1)
register_memory[i] = 32'b0;
// read_data = 32'b0;
// wire can not be set within a initial.
end
always @ ( posedge clk ) begin
if (MemWrite == 1'b1) begin
register_memory[index] = write_data;
end
end
assign read_data = (MemRead == 1'b1)?register_memory[index]:32'b0;
endmodule
VI. Control and Data Hazard
i. EXstage
Data Hazard
Branch prediction and forwarding help make a computer fast while still getting the right answers. This forwarding control will be in the EX stage, because the ALU forwarding multiplexors are found in that stage. Thus, we must pass the operand register numbers from the ID stage via the ID/EX pipeline register to determine whether to forward values.
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) )
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) )
ForwardB = 01
According to the figure belowe and the logic above, we can get the verilog program belowe.
module Forwarding(
// Input Register
input [4:0] ID_EX_Reg_Rt,
ID_EX_Reg_Rs,
IF_ID_Reg_Rs,
IF_ID_Reg_Rt,
EX_MEM_Reg_Rd,
MEM_WB_Reg_Rd,
// Input control signal
input EX_MEM_RegWrite,
MEM_WB_RegWrite,
// Forward for ALU input
output reg [1:0] Forward_ALU_A,
Forward_ALU_B,
// Forward for state reg input
output reg Forward_C,
Forward_D
);
always @(*)
begin
if(EX_MEM_RegWrite && (EX_MEM_Reg_Rd != 0) && (EX_MEM_Reg_Rd == ID_EX_Reg_Rs))
Forward_ALU_A = 2'b10;
// ID_EX for ALU
else if(MEM_WB_RegWrite && (MEM_WB_Reg_Rd != 0) && (MEM_WB_Reg_Rd == ID_EX_Reg_Rs))
Forward_ALU_A = 2'b01;
// MEM_WB for ALU
else
Forward_ALU_A = 2'b00;
// Regsiter for ALU
if(EX_MEM_RegWrite && (EX_MEM_Reg_Rd != 0) && (EX_MEM_Reg_Rd == ID_EX_Reg_Rt))
Forward_ALU_B = 2'b10;
else if(MEM_WB_RegWrite && (MEM_WB_Reg_Rd != 0) && (MEM_WB_Reg_Rd == ID_EX_Reg_Rt))
Forward_ALU_B = 2'b01;
else
Forward_ALU_B = 2'b00;
//Select the data input of ID_EX_State_reg
if(EX_MEM_RegWrite && (MEM_WB_Reg_Rd != 0) && (MEM_WB_Reg_Rd == IF_ID_Reg_Rs))
Forward_C = 1;
else
Forward_C = 0;
if(EX_MEM_RegWrite && (MEM_WB_Reg_Rd != 0) && (MEM_WB_Reg_Rd == IF_ID_Reg_Rt))
Forward_D = 1;
else
Forward_D = 0;
end
endmodule
ii. ID stage
Data Hazard
We use the Hazard Detection Unit
to detect the Load-Use Hazard and stall the pipline by one clock cycle.
ID/EX.MemRead and
((ID/EX.RegisterRt == IF/ID.RegisterRs) or
(ID/EX.RegisterRt == IF/ID.RegisterRt))
With the logic shown above, we can get the verilog program belowe.
module hazard_detection(
input [4:0] ID_Reg_Rt,
IF_Reg_Rs,
IF_Reg_Rt,
input EX_MemRead,
EX_regWirte,
MEM_MemRead,
Ins_Beq,
Ins_Bne,
output reg stall,
flush
);
initial begin
stall = 1'b0;
flush = 1'b0;
end
always @(*) begin
if(EX_MemRead&& ID_Reg_Rt&& ((ID_Reg_Rt == IF_Reg_Rs) || (ID_Reg_Rt == IF_Reg_Rt)))
begin
stall = 1'b1;
flush = 1'b1;
end else if(Ins_Beq || Ins_Bne) begin
if(EX_regWirte && ID_Reg_Rt&&((ID_Reg_Rt == IF_Reg_Rs) || (ID_Reg_Rt == IF_Reg_Rt))) begin
stall = 1'b1;
flush = 1'b1;
end else if (MEM_MemRead &&ID_Reg_Rt&& ((ID_Reg_Rt == IF_Reg_Rs) || (ID_Reg_Rt == IF_Reg_Rt))) begin
stall = 1'b1;
flush = 1'b1;
end else begin
stall = 1'b1;
flush = 1'b1;
end
end else begin
stall = 1'b0;
flush = 1'b0;
end
end
endmodule
Control Hazard
We have limited our concern to hazards involving arithmetic operations and data transfers. However, there are also pipeline hazards involving branches. An instruction must be fetched at every clock cycle to sustain the pipeline, yet in our design the decision about whether to branch doesn’t occur until the MEM pipeline stage. This delay in determining the proper instruction to fetch is called a control hazard or branch hazard, in contrast to the data hazards we have just examined.
We assume branch not taken and use dynamic branch prediction. We only change prediction on two successive mispredictions.
module HazardDetection (
input branchEqID,
branchNeID,
memReadEX,
regWriteEX,
memReadMEM,
input [4:0] registerRsID,
registerRtID,
registerRtEX,
registerRdEX,
registerRdMEM,
output reg stall,
flush
);
initial begin
stall = 1'b0;
flush = 1'b0;
end
always @ ( * ) begin
if (memReadEX && registerRtEX && (registerRtEX == registerRsID || registerRtEX == registerRtID)) begin
stall = 1'b1;
flush = 1'b1;
end else if (branchEqID || branchNeID) begin
if (regWriteEX && registerRdEX && (registerRdEX == registerRsID || registerRdEX == registerRtID)) begin
stall = 1'b1;
flush = 1'b1;
end else if (memReadMEM && registerRdMEM && (registerRdMEM == registerRsID || registerRdMEM == registerRtID)) begin
stall = 1'b1;
flush = 1'b1;
end else begin
stall = 1'b0;
flush = 1'b0;
end
end else begin
stall = 1'b0;
flush = 1'b0;
end
end
endmodule
VII. Instruction Implementation
i. Data tables
The following table shows the setting of the control lines for each instruction in our design.
Jump | ALUSrc | RegDst | ALUOp | MemWrite | MemRead | Branch | MemtoReg | RegWrite | |
---|---|---|---|---|---|---|---|---|---|
R-type | 0 | 0 | 1 | 10 | 0 | 0 | 0 | 0 | 1 |
addi | 0 | 1 | 0 | 00 | 0 | 0 | 0 | 0 | 1 |
andi | 0 | 1 | 0 | 00 | 0 | 0 | 0 | 0 | 1 |
slt | 0 | 1 | 0 | 00 | 0 | 0 | 0 | 0 | 1 |
beq | 0 | 0 | 0 | 01 | 0 | 0 | 1 | 0 | 0 |
bne | 0 | 0 | 0 | 01 | 0 | 0 | 1 | 0 | 0 |
lw | 0 | 1 | 0 | 00 | 0 | 1 | 0 | 1 | 1 |
sw | 0 | 1 | 0 | 00 | 1 | 0 | 0 | 0 | 0 |
j | 1 | 0 | 0 | 00 | 0 | 0 | 0 | 0 | 0 |
VIII. SSD and Top Module
The Basys 3 board contains a four-digit common anode SSD. The anodes of the seven LEDs forming each
digit are tied together into one “common anode” circuit node, but the LED cathodes remain separate. The
common anode signals are available as four “digit enable” input signals to the 4-digit display. The cathodes of
similar segments on all four displays are connected into seven circuit nodes labeled CA through CG (so, for
example, the four d cathodes from the four digits are grouped together into a single circuit node called “CD”).
These seven cathode signals are available as inputs to the 4-digit display. This signal connection scheme makes
the cathode signals common to all digits but they can only illuminate the segments of a digit whose
corresponding anode signal is asserted.
module SSD(Q,out);
input [3:0] Q;
output [6:0] out;
reg [6:0] out;
always @(Q) begin
case(Q)
4'b0000: out = 7'b0000001;
4'b0001: out = 7'b1001111;
4'b0010: out = 7'b0010010;
4'b0011: out = 7'b0000110;
4'b0100: out = 7'b1001100;
4'b0101: out = 7'b0100100;
4'b0110: out = 7'b0100000;
4'b0111: out = 7'b0001111;
4'b1000: out = 7'b0000000;
4'b1001: out = 7'b0000100;
default: out = 7'b1111110;
endcase
end
endmodule
IX. RTL Schematic (Optional)
Please check the RTL schematic on the next two pages.
X. Textual Result
By running the instruction memory, we can achieve appendix part.
XI. Conclusion and Discussion
In this project, a MIPS single-cycle processor and a MIPS pipelined processor are modeled, implemented using Verilog and simulated in Vivado. The pipelined processor program is run on a Xilinx FPGA board, which is portal to the Vivado Design Suite. The results on board coincide with the simulation results.
Throughout the implementation process, some unexpected events occurred and time were spent on debugging the modules. The events can be divided into two categories. The following discusses the reasons that led to the events.
- Simulation errors in Verilog HDL
- For the implementation of the
and
instruction in the ALU module, logical AND ‘&&’ should be used rather than bitwise AND ‘&’. Using bitwise AND will lead to problematic results. They are two different logical operations. - In the single cycle processor, due to the rising edge in the first clock cycle, if the first instruction is ‘lw’, error would occur. Therefore, the program counter starts from ‘-4’ instead of ‘0’.
- When coding the main processor module, where a large number of wires are connected to run the program, the naming of wires was not paid attention to. This led to error in wire connections. The reason is that Verilog is a case-sensitive language. For example,
WireReg
andwireReg
actually refer to different wires, resulting in the error. - The naming of the pipeline register wasn’t consistent for different components, some represented the output of a pipeline and some were the input of a pipeline. It led to some errors when simulating the processor too.
- Besides, during the debug process, we learned that the
assign
statement used for continuous assignment in Verilog can only drive a value to a net but it cannot assign a value to a register.
- For the implementation of the
- Unexpected results on FPGA board
- Because the events in Verilog mostly happen at ‘posedge Clock’ or ‘negedge Clock’, when the switch spends more than half a clock cycle time in between the ‘on’ and ‘off’ position, the program counter will not be determined and hence leads to varying random numbers.
By accomplishing the project, our understanding of how the central processing unit (CPU) works is deepened through practice. The mechanisms of forwarding and flushing in order to prevent hazards are carefully comprehended during the design process. While theoretical knowledge is the foundation, as programmers, the use of programming language has a series of rules and conventions to follow. Especially for naming, the habit of consistency is important and will save a lot of time in the debug process, no matter what programming project we work on in the future.
X. Reference
Hennessy, & JohnL. (1998). Computer organization and design:the hardware/software interface. Morgan Kaufmann Publishers.
Patterson, D. A., & Hennessy, J. L. (2014). Computer organization and design, fourth edition. Tailieu Vn.