if (clear_req) begin busy <= 1'b0; done <= 1'b0; end else if (start_req && !busy) begin busy <= 1'b1; done <= 1'b0; relu_enable <= ctrl_write_data[1]; k_idx <= 2'd0; clear_accumulators(); end
软件侧轮询可以写成:
1 2 3 4 5 6 7
intwait_done(unsigned timeout) { while (timeout--) { uint32_t status = read_reg(STATUS); if (status & STATUS_DONE) return0; } return-1; }
这是一种很原始但很清晰的协处理器协议。
4x4 乘法核
硬件核内部做的是:
1 2 3 4
for k in 0..3: for row in 0..3: for col in 0..3: acc[row][col] += A[row][k] * B[k][col]
if (busy) begin for (row = 0; row < 4; row++) begin for (col = 0; col < 4; col++) begin acc[row][col] <= acc[row][col] + $signed(get_lane(a_regs[row], k_idx)) * $signed(get_lane(b_regs[k_idx], col));
if (k_idx == 2'd3) begin c_regs[row][col] <= relu_enable ? relu32(acc_next[row][col]) : acc_next[row][col]; end end end
if (k_idx == 2'd3) begin busy <= 1'b0; done <= 1'b1; end else begin k_idx <= k_idx + 1'b1; end end
这里的 relu32() 很简单:
1 2 3 4
function signed [31:0] relu32; input signed [31:0] value; relu32 = value[31] ? 32'sd0 : value; endfunction