解析RISC-V A拓展--Eventual Success of Store-Conditional Instructions

John · 2024 年11 月 28 日 14:59

前言

最近在读RISCV SPEC A拓展时，发现有一个章节是讲Eventual Success of Store-Conditional Instructions，里面规定了一种constrained LR/SC loop，这种loop具有以下的限制，如下图所示。有同事把他理解为：“在违反这些限制的时候，应该清除reservation而使SC失败”，实则不然。为了方便大家能正确理解，对constrained LR/SC loop在这里进行解释。

• The loop comprises only an LR/SC sequence and code to retry the sequence in the case of failure, and must comprise at most 16 instructions placed sequentially in memory.
• An LR/SC sequence begins with an LR instruction and ends with an SC instruction. The dynamic code executed between the LR and SC instructions can only contain instructions from the base ‘‘I’’ instruction set, excluding loads, stores, backward jumps, taken backward branches, JALR, FENCE, and SYSTEM instructions. If the ‘‘C’’ extension is supported, then compressed forms of the aforementioned ‘‘I’’ instructions are also permitted.
• The code to retry a failing LR/SC sequence can contain backwards jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same constraint as the code between the LR and SC.
• The LR and SC addresses must lie within a memory region with the LR/SC eventuality property. The execution environment is responsible for communicating which regions have this property.
• The SC must be to the same effective address and of the same data size as the latest LR executed by the same hart.

背景

LR/SC 指令在 ARM 架构中被广泛应用，与之功能相当的是 x86 的 compare-and-swap (CAS) 指令。RISCV在制定的时候，在面对怎么实现lock-free的数据结构的问题时，充分讨论了是选择LR还是选择CAS，但最终还是选择了LR/SC（原因在附录中说明）。

LR/SC往往是一个循环，LR先建立一个reservation，SC仅能在reservation有效时成功。而在SC失败时，后续程序往往会检测SC的回写值是否为非0，如果非0，跳回去将LR/SC序列重做一遍，这样的序列，我们称为LR/SC loop。

但是LR/SC指令有着一大缺陷，就是容易造成系统的活锁（livelock），其实这就违反了lock-free的原则，即可能没有任何一个hart能跳出这一循环。
举个例子，两个核都在执行LR/SC序列，但是如果两个核中的LR/SC序列中都包含store，可能会引起两个核都一直都陷在LR/SC循环中无法跳出。

解决措施

为了避免livelock情况的发生，RISCV定义了一种constrained LR/SC loop，只要软件使用该种constrained LR/SC loop，则基本能保证livelock问题不会一直持续，即总有某个核能跳出LR/SC循环，取得某种进展。一旦违反这些约束的话，可能会导致多核或单核的LR/SC loop陷入死循环，导致系统的挂死。

下面解释下每条约束的设置原因，即违反该条约束为什么有可能会导致LR/SC loop进入死循环。

限制1：

The loop comprises only an LR/SC sequence and code to retry the sequence in the case of failure, and must comprise at most 16 instructions placed sequentially in memory.

限制LR/SC loop仅能包含连续的16条指令，是为了使得LR/SC loop的程序能够完整加载进cache，然后屏蔽外界对cache的干扰，完整地执行完LR/SC loop。
举个极端的反例，在某个实现中，icache有2-way，如果指令数超过16，会导致每次只能将部分LR/SC循环中的部分指令加载进icache，数据和指令共享总线，且总线（AHB, AXI3）通过lock传输来实现原子访问，在总线上取指令将会使lock传输失效，导致每次SC都会失败，从而导致核一直无法跳出LR/SC循环。

事实上，16这个数字只是对软件和硬件的一个权衡：

数字越大，软件越方便编程，而硬件越难以将整段指令加载进cache。
数字越小，硬件越方便实现，越容易将整段指令加载进cache中，而软件越难编程

一般的硬件实现，只要有合适的cacheline大小以及set排布方式，都能满足将16条连续指令加载进cache中的要求。

限制2：

An LR/SC sequence begins with an LR instruction and ends with an SC
instruction. The dynamic code executed between the LR and SC
instructions can only contain instructions from the base ‘‘I’’
instruction set, excluding loads, stores, backward jumps, taken
backward branches, JALR, FENCE, and SYSTEM instructions. If the ‘‘C’’
extension is supported, then compressed forms of the aforementioned
‘‘I’’ instructions are also permitted.

LR和SC之间的代码仅能包含基础I拓展指令集中的指令，且不能有load, store。store很好理解，上文已经举了一个例子，如果包含store的话，可能会挂死。
但load其实在某些实现中，也会导致锁的失效，举个极端的反例，对于只有一个way的cache，load指令可能会导致含有reservation的那个cache line被踢出，从而导致锁的失效。

除此之外，禁止向后跳转指令是为了限制执行LR/SC的时间，而fence和系统指令也可能会导致含有reservation的失效。
需要注意的是：作者在这里特地强调了是dynamic code，是指在LR和成功的SC之间不能有向后跳转，而不是说LR和SC之间不能有向后跳转的指令

限制3：

The code to retry a failing LR/SC sequence can contain backwards jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same constraint as the code between the LR and SC.

该条针对LR/SC loop中的去重做LR/SC的代码（在LR/SC序列之外）。当然，该重做代码可以包含可以包含向后跳指令，其他约束和LR/SC sequence中的代码约束相同。

限制4：

The LR and SC addresses must lie within a memory region with the LR/SC eventuality property. The execution environment is responsible for communicating which regions have this property.

LR和SC必须处于相同的memory空间，并且该空间能保证LR和SC能够最终完成，即该物理空间不能导致LR/SC sequence一直失败

限制5：

The SC must be to the same effective address and of the same data size as the latest LR executed by the same hart.

LR和SC必须具有相同的addr和size，否则非常容易出现系统的挂死

结论

因此，constrained LR/SC只是对软件的建议，软件需要在使用LR/SC时，遵守这些限制，从而避免LR/SC的无限循环。
但软件也并非不能使用违反了某些约束的unconstrained LR/SC loop，合适的软件，应该具有检测重复失败的SC的机制，一旦发现SC反复失败，应该跳到不依赖于LR/SC的程序中，从而结束无限的LR/SC循环。

引申

那么哪些情况应该清除reservation呢？一般来说，以下情况应该清除reservation

SC完成时，无论成功或失败，都应该清除reservation。
当建立的reservation被别的核改写过时，应该清除。在有cache的实现中，含有该reservation的cacheline被invalidate/evict时，无论是因为别的核snoop还是本核自己触发的，都应该清除reservation。
当响应trap（中断、同步异常、异步异常），或从trap程序中返回（mret/sret）时，应该清除reservation。甚至，如果有允许user-mode的线程相互抢占的场景，uret也需要清除reservation。
当进入debug mode（trigger, ebreak, halt, single step）或从debug返回(resume)时，应该清除reservation。

在某些情况，我们可能并不能确定某事件是否应该清除reservation，但如果该事件不会一直阻止LR/SC的最终完成，我们也可以投机地清除reservation，毕竟SC失败了之后，LR/SC loop会再次循环，重新做一遍。

附录

相对于CAS来说，LR/SC有以下的好处：

CAS 有 ABA 问题，即只关心结果，不关心过程，只要上一次 load 的值和 cas 获取的值一致，那么新值就成功写入。即使有人写过该地址，或者写过两次，第二次又改回来。这有时候不符合程序员的预期，破坏了原子性。而 LR/SC 会监视任何写入操作，即使写入相同值也会破坏 SC 指令。
CAS 硬件实现过于复杂，它需要三个源寄存器，和一个目的寄存器 (结果)。
为了解决 ABA 问题，有些系统提供了 DW-CAS 指令，这个指令的实现更加复杂，需要 5 个源寄存器和 2 个目的寄存器。
LR/SC 的效率优于 CAS，因为 CAS 总是要多一条 load 指令，即 load + CAS 指令，而 LR + SC 只有一条 load 指令。

基于以上原因，RISCV的制定者们选择了LR/SC指令。
BUT！
但事实上，软件 API 并不能很好地支持 LR/SC，譬如 Linux 只用 cmpxchg 原语直接对应 CAS 指令，而没有 load_reserved/store_conditional 原语，这导致实际使用是用 LR/SC 去实现 cmpxchg。

#a0 holds address of memory location 
#a1 holds expected value 
#a2 holds desired value 
#a0 holds return value, 0 if successful, !0 otherwise 
cas: 
lr.w t0, (a0) #Load original value. 
bne t0, a1, fail #Doesn’t match, so fail. 
sc.w t0, a2, (a0) # Try to update. 
bnez t0, cas # Retry if store-conditional failed. 
li a0, 0 # Set return to success. 
jr ra # Return. 

fail:
li a0, 1 # Set return to failure. 
jr ra # Return.

加上 cmpxchg 使用本身的循环结构，实际上构成了双循环实现：

c = v->counter; 
while ((old = cmpxchg(&v->counter, c, c c_op i)) != c)
       c = old;

如果这样使用，那么 RISC-V 的 4 条理由中，1) 3) 4) 的好处都不存在了，所以放弃 CAS 对软件兼容性产生负面影响。arm64 目前支持了 CAS 指令，就是很好的例证。