redis并发性与设计

Posted on 2017-05-20

也许以前对并发性理解的不够深，也许对一些概念和设计的理解还不够好，这样往往会在写代码的过程中，会留下一个感觉不是bug的bug,让你狠抓头皮，不能想明白，到底发生了什么？所以在考虑问题和在设计过程中，需要注意一些比较成熟的理念，认识到你使用的产品和所写的业务是一种什么的业务，并发性和耗时在一定程度上有一定的关联，你可能想想不到这种关联在什么程度上影响着你的代码的执行，先看下以下的代码，抛开设计，只看代码

$redis = new redis();
$redis->connect('127.0.0.1',6379);
$key = 'users';
$page = $argv[1];
$offset = 10;
$key = 'users';
$result = $redis->zrange($key,($page-1) * $offset - 1, $offset);
if(empty($result)){
    $redis->del($key);
    $redis->expire($key,30);
    $i = 0;
    while($i < 100){
        $redis->zIncrBy($key,1,$i);
        sleep(2);
        $i++;
    }
}

代码很简单，就是如果缓存为空，就重新写缓存，也许你并没有发现这段代码到底存在什么问题，那么就看看下面的问题，先执行：

127.0.0.1:6379> zrange users 0 -1 withscores
 1) "0"
 2) "1"
 3) "1"
 4) "1"
 5) "2"
 6) "1"
 7) "3"
 8) "1"
 9) "4"
10) "1"

哇，进去了呢，然后再来一次

127.0.0.1:6379> zrange users 0 -1 withscores
 1) "0"
 2) "1"
 3) "1"
 4) "1"
 5) "2"
 6) "1"
 7) "3"
 8) "1"
 9) "4"
10) "1"

哇，还是进去了呢，不经喊了句，太完美了，暗暗自喜！难道真没问题了么？然后开启两个进程，看看发生了什么了吧，请看下面的执行结果

127.0.0.1:6379> zrange users 0 -1 withscores
 1) "0"
 2) "1"
 3) "1"
 4) "1"
 5) "2"
 6) "2"
 7) "3"
 8) "2"
 9) "4"
10) "2"

糟糕，怎么会这样呢？这时候，你可能在问，靠，不会吧？但是事实就是这样，你的程序出现bug了，如果一个在线上运行的代码，你很有可能不知道怎么发生的，然后盯着你的代码在看，这怎么回事呢？没有日志可看，这时候，你傻眼了吧！这就是并发问题，这到底是怎么发生的呢？你删除了key,但是后面一直在运行了，还等不急运行完，然后有一个请求来了，正是这样耗时的操作，导致了你的设计跟你的想法背道而驰，话说回来了，这样的设计是一种比较糟糕的设计，把耗时的操作分离出来才是一种最佳实践，所以，别懒了，读写分离吧，一方面可以加快接口的速度，另一方面，当数据量比较大的时候，你就玩完了，还有就是避免这还种并发性问题，所以，别懒了，做好你的设计吧！想明白自己在做什么！

图片与CDN

Posted on 2017-05-08

工作小记，关于图片，免不了使用CDN加速，但是那么多的图片，在处理的时候，如何处理失效是一个问题，因此在设计图片失效的问题上，一般处理方式是当图片更新之后，追加时间戳，但是这种方式存在的一个问题，谁来加这个时间戳，如何来存储时间戳，我们使用的方案是，当图片发生更新，就记录一个时间戳，并写入缓存，然后读取图片时候，看当前的图片是否存在时间戳，如果存在，则追加时间戳，如果不存在，则保持不变，的确这样是可以做到图片的实时更新，但是这样涉及到大量的存储，特别是当图片很多的时候。

关于code review

Posted on 2017-05-03

什么样的代码才是好的代码？代码review需要注意哪些细节？

1.代码规范，每个公司都有自己的一个代码规范，比如:

1) 命名规范？
2) 换行还是空行？
3) Tab还是空格？
4) 等号括号两边是否使用空格？
5) 注释合理？
6) 功能结构单一，函数和类不能太长，要具有模块性，单一职责？
7) 单元测试，测试用例覆盖是否全面？
8) git使用是否规范？
......

2.CodeReview-checklist，每个团队或者个人都应该建立自己的checklist，在每次提交代码之前，都过一遍，比如:

1) 什么写法可能导致性能低下？
2) 哪个接口或者函数要慎用？
3) 哪些设计方式需要规避？
3) 什么习惯容易引发内存泄漏？
.....

3.影响范围评估

1) 配置修改？各个环境是否兼容？
2) 全局变量和全局函数修改？
3) 是否存在数据安全和一致性问题？
4) 容量评估？处理能力评估？是否会引起阻塞？
5) 梳理业务，此修改是否全面，是否修改了所有相关的业务，还是只是修改了一部分？
......

4.design review

1) 实现是否符合预期设计？
2) 是否需要压测？
3) 独立部署还是分开部署？是否需要进行容量评估？
4）是否需要部署任务脚本？
5) 是否需要申请独立域名？
.....

5.性能评估

1) 关注for循环，关注批量处理，能合就合，能简就简，不能将就
2) 关注索引，是否使用了索引
3) 评估业务模型，是cpu密集型还是io密集型，简化设计，做好测试
.....

6.review级别

1）可编译
2）可运行
3）可测试
4）可读
5）可维护
6）可重用
7）可扩展

7.review工具

1
2
3

1) Phabricator
2) Gerrit
3) Gitlab

8.需求review

1) 分析需求
2) 需求是否合理
3) 需求是否紧急
4) 大需求？小需求？
5) 拒绝和简化需求

代码review是一个关注自己成长和产品质量的一个问题，每个程序员都应该引起重视，把不好的习惯扼杀掉，减少了bug，同时也能得到同事的认可，也能让自己给别人找找茬，不要让自己累成牲口，停下来想想，问一下自己，自己成为牲口的原因，是不是就是因为自己做事时候像就牲口一样思考？

并发编程 Promise, Future 和 Callback

Posted on 2017-04-04

Futures and Promises

By: Philipp Haller, Aleksandar Prokopec, Heather Miller, Viktor Klang, Roland Kuhn, and Vojin Jovanovic

This SIP is part of two SIPs, which together constitute a redesign of scala.concurrent into a unified substrate for a variety of parallel frameworks. This proposal focuses on futures and promises.

Introduction
Futures provide a nice way to reason about performing many operations in parallel– in an efficient and non-blocking way. The idea is simple, a Future is a sort of placeholder object that you can create for a result that doesn’t yet exist. Generally, the result of the Future is computed concurrently and can be later collected. Composing concurrent tasks in this way tends to result in faster, asynchronous, non-blocking parallel code.

This is particularly evident due to the fact that within the Scala ecosystem alone, several frameworks aiming to provide a full-featured implementation of futures and promises have arisen, including the futures available in the Scala Actors package [4], Akka [3], Finagle [2], and Scalaz [5].

The redesign of scala.concurrent provides a new Futures and Promises API, meant to act as a common foundation for multiple parallel frameworks and libraries to utilize both within Scala’s standard library, and externally.

By default, futures and promises are non-blocking, making use of callbacks instead of typical blocking operations. In an effort to facilitate, and make use of callbacks on a higher-level, we provide combinators such as flatMap, foreach, and filter for composing futures in a non-blocking way. For cases where blocking is absolutely necessary, futures can be blocked on (although it is discouraged).

The futures and promises API builds upon the notion of an ExecutionContext, an execution environment designed to manage resources such as thread pools between parallel frameworks and libraries (detailed in an accompanying SIP, forthcoming). Futures and promises are created through such ExecutionContexts. For example, this makes it possible, in the case of an application which requires blocking futures, for an underlying execution environment to resize itself if necessary to guarantee progress.

Futures
A future is an abstraction which represents a value which may become available at some point. A Future object either holds a result of a computation or an exception in the case that the computation failed. An important property of a future is that it is in effect immutable– it can never be written to or failed by the holder of the Future object.

The simplest way to create a future object is to invoke the future method which starts an asynchronous computation and returns a future holding the result of that computation. The result becomes available once the future completes.

Here is an example. Let’s assume that we want to use the API of some popular social network to obtain a list of friends for a given user. After opening a new session we want to create an asynchronous request to the server for this list:

import scala.concurrent.Future
val session = socialNetwork.createSessionFor(“user”, credentials)
val f: Future[List[Friend]] = Future {
session.getFriends
}
The list of friends becomes available in the future f once the server responds.

An unsuccessful attempt may result in an exception. In the following example, the session value is incorrectly initialized, so the future will hold a NullPointerException instead of the value:

val session = null
val f: Future[List[Friend]] = Future {
session.getFriends
}
Callbacks
We are generally interested in the result value of the computation. To obtain the future’s result, a client of the future would have to block until the future is completed. Although this is allowed by the Future API as we will show later in this document, a better way to do it is in a completely non-blocking way, by registering a callback on the future. This callback is called asynchronously once the future is completed. If the future has already been completed when registering the callback, then the callback may either be executed asynchronously, or sequentially on the same thread.

The most general form of registering a callback is by using the onComplete method, which takes a callback function of type Either[Throwable, T] => U. The callback is applied to the value of type Right[T] if the future completes successfully, or to a value of type Left[Throwable] otherwise. The onComplete method is parametric in the return type of the callback, but it discards the result of the callback.

Coming back to our social network example, let’s assume we want to fetch a list of our own recent posts and render them to the screen. We do so by calling the method getRecentPosts which returns a List[String]:

val f: Future[List[String]] = Future {
session.getRecentPosts
}
f onComplete {
case Right(posts) => for (post <- posts) render(post)
case Left(t) => render(“An error has occured: “ + t.getMessage)
}
The onComplete method is general in the sense that it allows the client to handle the result of both failed and successful future computations. To handle only successful results, the onSuccess callback is used (which takes a partial function):

val f: Future[List[String]] = Future {
session.getRecentPosts
}
f onSuccess {
case posts => for (post <- posts) render(post)
}
To handle failed results, the onFailure callback is used:

val f: Future[List[String]] = Future {
session.getRecentPosts
}
f onFailure {
case t => render(“An error has occured: “ + t.getMessage)
}
f onSuccess {
case posts => for (post <- posts) render(post)
}
The onFailure callback is only executed if the future fails, that is, if it contains an exception. The onComplete, onSuccess, and onFailure methods have result type Unit, which means invocations of these methods cannot be chained. This is an intentional design decision which was made to avoid suggesting that chained invocations may imply an ordering on the execution of the registered callbacks (callbacks registered on the same future are unordered).

Since partial functions have the isDefinedAt method, the onFailure method only triggers the callback if it is defined for a particular Throwable. In the following example the registered callback is never triggered:

val f = Future {
2 / 0
}
f onFailure {
case npe: NullPointerException =>
println(“I’d be amazed if this printed out.”)
}
Having a regular function callback as an argument to onFailure would require including the default case in every failure callback, which is cumbersome– omitting the default case would lead to MatchErrors later.

Second, try-catch blocks also expect a PartialFunction value. That means that if there are generic partial function exception handlers present in the application then they will be compatible with the onFailure method.

In conclusion, the semantics of callbacks are as follows:

Registering an onComplete callback on the future ensures that the corresponding closure is invoked after the future is completed, eventually.

Registering an onSuccess or onFailure callback has the same semantics as onComplete, with the difference that the closure is only called if the future is completed successfully or fails, respectively.

Registering a callback on the future which is already completed will result in the callback being executed eventually (as implied by 1). Furthermore, the callback may even be executed synchronously on the same thread that registered the callback if this does not cancel progress of that thread.

In the event that multiple callbacks are registered on the future, the order in which they are executed is not defined. In fact, the callbacks may be executed concurrently with one another. However, a particular Future implementation may have a well-defined order.

In the event that some of the callbacks throw an exception, the other callbacks are executed regardlessly.

In the event that some of the callbacks never complete (e.g. the callback contains an infinite loop), the other callbacks may not be executed at all. In these cases, a potentially blocking callback must use the blocking construct (see below).

Once executed, the callbacks are removed from the future object, thus being eligible for GC.

Functional Composition and For-Comprehensions
The examples we have shown so far lend themselves naturally to the functional composition of futures. Assume we have an API for interfacing with a currency trading service. Suppose we want to buy US dollars, but only when it’s profitable. We first show how this could be done using callbacks:

val rateQuote = Future {
connection.getCurrentValue(USD)
}
rateQuote onSuccess { case quote =>
val purchase = Future {
if (isProfitable(quote)) connection.buy(amount, quote)
else throw new Exception(“not profitable”)
}

purchase onSuccess {
case _ => println(“Purchased “ + amount + “ USD”)
}
}
We start by creating a future which fetches the current exchange rate. After it’s successfully obtained from the server, we create another future which makes a decision to buy only if it’s profitable to do so, and then sends a requests.

This works, but is inconvenient for two reasons. First, we have to use onSuccess, and we have to nest the second purchase future within it. Second, the purchase future is not in the scope of the rest of the code.

For these two reasons, futures provide combinators which allow a more straightforward composition. One of the basic combinators is map, which, given a future and a mapping function for the value of the future, produces a new future that is completed with the mapped value once the original future is successfully completed. Let’s rewrite the previous example using the map combinator:

val rateQuote = Future {
connection.getCurrentValue(USD)
}
val purchase = rateQuote map {
quote => if (isProfitable(quote)) connection.buy(amount, quote)
else throw new Exception(“not profitable”)
}
purchase onSuccess {
case _ => println(“Purchased “ + amount + “ USD”)
}
The semantics of map is as follows. If the original future is completed successfully then the returned future is completed with a mapped value from the original future. If the mapping function throws an exception the future is completed with that exception. If the original future fails with an exception then the returned future also contains the same exception. This exception propagating semantics is present in the rest of the combinators, as well.

To enable for-comprehensions, futures also have the flatMap, filter and foreach combinators. The flatMap method takes a function that maps the value to a new future g, and then returns a future which is completed once g is completed.

Lets assume that we want to exchange US dollars for Swiss francs (CHF). We have to fetch quotes for both currencies, and then decide on buying based on both quotes. Here is an example of flatMap usage within for-comprehensions:

val usdQuote = Future { connection.getCurrentValue(USD) }
val chfQuote = Future { connection.getCurrentValue(CHF) }
val purchase = for {
usd <- usdQuote
chf <- chfQuote
if isProfitable(usd, chf)
} yield connection.buy(amount, chf)
purchase onSuccess {
case _ => println(“Purchased “ + amount + “ CHF”)
}
The filter combinator creates a new future which contains the value of the original future only if it satisfies some predicate. Otherwise, the new future is failed with a NoSuchElementException.

It is important to note that calling the foreach combinator does not block. Instead, the function for the foreach gets asynchronously executed only if the future is completed successfully. This means that the foreach has exactly the same semantics as the onSuccess callback.

Since the Future trait can conceptually contain two types of values (computation results and exceptions), there exists a need for combinators which handle exceptions.

Let’s assume that based on the rateQuote we decide to buy a certain amount. The connection.buy method takes an amount to buy and the expected quote. It returns the amount bought. If the quote has changed in the meanwhile, it will throw a QuoteChangedException and it will not buy anything. If we want our future to contain 0 instead of the exception, we use the recover combinator:

val purchase: Future[Int] = rateQuote map {
quote => connection.buy(amount, quote)
} recover {
case quoteExc: QuoteChangedException => 0
}
The recover combinator creates a new future which holds the same result as the original future if it completed successfully. If it did not then the partial function argument is applied to the Throwable which failed the original future. If it maps the Throwable to some value, then the new future is successfully completed with that value.

The recoverWith combinator creates a new future which holds the same result as the original future if it completed successfully. Otherwise, the partial function is applied to the Throwable which failed the original future. If it maps the Throwable to some future, then this future is completed with the result of that future. Its relation to recover is similar to that of flatMap to map.

Combinator fallbackTo creates a new future which holds the result of this future if it was completed successfully, or otherwise the successful result of the argument future. In the event that both this future and the argument future fail, the new future is completed with the exception from this future, as in the following example which tries to print US dollar value, but prints the Swiss franc value in the case it fails to obtain the dollar value:

val usdQuote = Future {
connection.getCurrentValue(USD)
} map {
usd => “Value: “ + usd + “$”
}
val chfQuote = Future {
connection.getCurrentValue(CHF)
} map {
chf => “Value: “ + chf + “CHF”
}
val anyQuote = usdQuote either chfQuote
anyQuote onSuccess { println(_) }
The andThen combinator is used purely for side-effecting purposes. It returns a new future with exactly the same result as the current future, irregardless of whether the current future failed or not. Once the current future is completed with the result, the closure corresponding to the andThen is invoked and then the new future is completed with the same result as this future. This ensures that multiple andThen calls are ordered, as in the following example which stores the recent posts from a social network to a mutable set and then renders all the posts to the screen:

val allposts = mutable.SetString
Future {
session.getRecentPosts
} andThen {
case Success(posts) => allposts ++= posts
} andThen {
case _ =>
clearAll()
for (post <- allposts) render(post)
}
In summary, the combinators on futures are purely functional. Every combinator returns a new future which is related to the future it was derived from.

Projections
To enable for-comprehensions on a result returned as an exception, futures also have projections. If the original future fails, the failed projection returns a future containing a value of type Throwable. If the original future succeeds, the failed projection fails with a NoSuchElementException. The following is an example which prints the exception to the screen:

val f = Future {
2 / 0
}
for (exc <- f.failed) println(exc)
The following example does not print anything to the screen:

val f = Future {
4 / 2
}
for (exc <- f.failed) println(exc)
Extending Futures
Support for extending the Futures API with additional utility methods is planned. This will allow external frameworks to provide more specialized utilities.

Blocking
As mentioned earlier, blocking on a future is strongly discouraged – for the sake of performance and for the prevention of deadlocks – in favour of using callbacks and combinators on futures. However, blocking may be necessary in certain situations and is supported by the Futures API.

In the currency trading example above, one place to block is at the end of the application to make sure that all of the futures have been completed. Here is an example of how to block on the result of a future:

import scala.concurrent._
def main(args: Array[String]) {
val rateQuote = Future {
connection.getCurrentValue(USD)
}

val purchase = rateQuote map {
quote => if (isProfitable(quote)) connection.buy(amount, quote)
else throw new Exception(“not profitable”)
}

blocking(purchase, 0 ns)
}
In the case that the future fails, the caller is forwarded the exception that the future is failed with. This includes the failed projection– blocking on it results in a NoSuchElementException being thrown if the original future is completed successfully.

The Future trait implements the Awaitable trait with a single method await(). The await() method contains code which can potentially result in a long running computation, block on some external condition or which may not complete the computation at all. The await() method cannot be called directly by the clients, it can only be called by the execution context implementation itself. To block on the future to obtain its result, the blocking method must be used.

val f = Future { 1 }
val one: Int = blocking(f, 0 ns)
To allow clients to call 3rd party code which is potentially blocking and avoid implementing the Awaitable trait, the same blocking primitive can also be used in the following form:

blocking {
potentiallyBlockingCall()
}
The blocking code may also throw an exception. In this case, the exception is forwarded to the caller.

Exceptions
When asynchronous computations throw unhandled exceptions, futures associated with those computations fail. Failed futures store an instance of Throwable instead of the result value. Futures provide the onFailure callback method, which accepts a PartialFunction to be applied to a Throwable. The following special exceptions are treated differently:

TimeoutException - stored when the computation is not completed before some timeout (typically managed by an external scheduler).
scala.runtime.NonLocalReturnControl[_] - this exception holds a value associated with the return. Typically, return constructs in method bodies are translated to throws with this exception. Instead of keeping this exception, the associated value is stored into the future or a promise.

ExecutionException - stored when the computation fails due to an unhandled InterruptedException, Error or a scala.util.control.ControlThrowable. In this case the ExecutionException has the unhandled exception as its cause. These exceptions are rethrown in the thread executing the failed asynchronous computation. The rationale behind this is to prevent propagation of critical and control-flow related exceptions normally not handled by the client code and at the same time inform the client in which future the computation failed.

Promises
While futures are defined as a type of read-only placeholder object created for a result which doesn’t yet exist, a promise can be thought of as a writeable, single-assignment container, which completes a future. That is, a promise can be used to successfully complete a future with a value (by “completing” the promise) using the success method. Conversely, a promise can also be used to complete a future with an exception, by failing the promise, using the failure method.

A promise p completes the future returned by p.future. This future is specific to the promise p. Depending on the implementation, it may be the case that p.future == p.

Consider the following producer-consumer example:

import scala.concurrent.{ Future, Promise }
val p = PromiseT
val f = p.future
val producer = Future {
val r = produceSomething()
p success r
continueDoingSomethingUnrelated()
}
val consumer = Future {
startDoingSomething()
f onSuccess {
case r => doSomethingWithResult()
}
}
Here, we create a promise and use its future method to obtain the Future that it completes. Then, we begin two asynchronous computations. The first does some computation, resulting in a value r, which is then used to complete the future f, by fulfilling p. The second does some computation, and then reads the result r of the completed future f. Note that the consumer can obtain the result before the producer task is finished executing the continueDoingSomethingUnrelated() method.

As mentioned before, promises have single-assignment semantics. As such, they can be completed only once. Calling success on a promise that has already been completed (or failed) will throw an IllegalStateException.

The following example shows how to fail a promise.

val p = PromiseT
val f = p.future
val producer = Future {
val r = someComputation
if (isInvalid(r))
p failure (new IllegalStateException)
else {
val q = doSomeMoreComputation(r)
p success q
}
}
Here, the producer computes an intermediate result r, and checks whether it’s valid. In the case that it’s invalid, it fails the promise by completing the promise p with an exception. In this case, the associated future f is failed. Otherwise, the producer continues its computation, and finally completes the future f with a valid result, by completing promise p.

Promises can also be completed with a complete method which takes either a failed result of type Left[Throwable] or a successful result of type Right[T].

Analogous to success, calling failure and complete on a promise that has already been completed will throw an IllegalStateException.

One nice property of programs written using promises with operations described so far and futures which are composed through monadic operations without side-effects is that these programs are deterministic. Deterministic here means that, given that no exception is thrown in the program, the result of the program (values observed in the futures) will always be the same, irregardless of the execution schedule of the parallel program.

In some cases the client may want to complete the promise only if it has not been completed yet (e.g., there are several HTTP requests being executed from several different futures and the client is interested only in the first HTTP response - corresponding to the first future to complete the promise). For these reasons methods tryComplete, trySuccess and tryFailure exist on future. The client should be aware that using these methods results in programs which are not deterministic, but depend on the execution schedule.

The method completeWith completes the promise with another future. After the future is completed, the promise gets completed with the result of that future as well. The following program prints 1:

val f = Future { 1 }
val p = PromiseInt
p completeWith f
p.future onSuccess {
case x => println(x)
}
When failing a promise with an exception, three subtypes of Throwables are handled specially. If the Throwable used to break the promise is a scala.runtime.NonLocalReturnControl, then the promise is completed with the corresponding value. If the Throwable used to break the promise is an instance of Error, InterruptedException, or scala.util.control.ControlThrowable, the Throwable is wrapped as the cause of a new ExecutionException which, in turn, is failing the promise.

Utilities
To simplify handling of time in concurrent applications scala.concurrent will introduce a Duration abstraction. Duration is not supposed be yet another general time abstraction. It is meant to be used with concurrency libraries and will reside in scala.concurrent.util package.

Duration is the base class representing length of time. It can be either finite or infinite. Finite duration is represented with FiniteDuration class which is constructed from Long length and java.util.concurrent.TimeUnit. Infinite durations, also extended from Duration, exist in only two instances , Duration.Inf and Duration.MinusInf. Library also provides several Duration subclasses for implicit conversion purposes and those should not be used.

Abstract Duration contains methods that allow :

Conversion to different time units (toNanos, toMicros, toMillis, toSeconds, toMinutes, toHours, toDays and toUnit(unit: TimeUnit)).
Comparison of durations (<, <=, > and >=).
Arithmetic operations (+, -, *, / and unary-).
Minimum and maximum between this duration and the one supplied in the argument (min, max).
Check if the duration is finite (finite?).
Duration can be instantiated in the following ways:

Implicitly from types Int and Long. For example val d = 100 millis.
By passing a Long length and a java.util.concurrent.TimeUnit. For example val d = Duration(100, MILLISECONDS).
By parsing a string that represent a time period. For example val d = Duration(“1.2 µs”).
Duration also provides unapply methods so it can be used in pattern matching constructs. Examples:

import scala.concurrent.util.Duration
import scala.concurrent.util.duration.
import java.util.concurrent.TimeUnit.

// instantiation
val d1 = Duration(100, MILLISECONDS) // from Long and TimeUnit
val d2 = Duration(100, “millis”) // from Long and String
val d3 = 100 millis // implicitly from Long, Int or Double
val d4 = Duration(“1.2 µs”) // from String
// pattern matching
val Duration(length, unit) = 5 millis
References
The Task-Based Asynchronous Pattern, Stephen Toub, Microsoft, April 2011
Finagle Documentation
Akka Documentation: Futures
Scala Actors Futures
Scalaz Futures

无锁编程C++中的应用

Posted on 2017-04-04

At CppCon 2014, Herb Sutter gave a talk about lock-free programming in C++ (part 1, part 2) where he provided the fundamental concepts of lock-free programming, and presented three algorithms to show lock-free techniques. Here is a summary of the most relevant points in the talk.

The first point that Herb addresses is what benefits lock-free code can bring:

Improving concurrency and scalability by reducing blocking/waiting.
Getting rid of potential issues such as race conditions, deadlocks, scarce composability.
Lock-free programming is no panacea, though, since a lock-free algorithm is usually more complex to design, and it has other potential issues, such as contention, which can badly affect performance. From this, it ensues Herb’s first strong suggestion:

Only apply lock-free technique once you have measured your program and found it has some performance or scalability problem.
After implementing the lock-free algorithm, measure again to check that things improved effectively.
Fundamentals

There a few basic principles when thinking about lock-free algorithms:

Consistency: any significant piece of code should be though of as a transaction bringing the system from one consistent state to another.
Isolation: two transactions never operate on the same data.
Durability: a committed transaction is never overwritten by another before the latter has “seen” the results of the former.
The fundamental technique to achieve that is:

computing all changes in some private area;
using one single atomic write through a special compare/exchange function to make them public.
In C++11, this translates into using atomic, which gives two big advantages:

You can think of each read or write operation as atomic, with no lock required.
Furthermore, reads and writes are guaranteed not to be reordered.
A few important points about atomic are the following:

For small types, atomic is usually implemented through platform-specific operations.
atomic objects must always be initialized (otherwise they will contain garbage).
Two atomic operations are guaranteed to be individually atomic, but the status of the object can change in between.
Example: Double-checked Locking

The first example Herb presents is ensuring a global object is constructed only once.

The key idea is: protecting the atomic write operation through a lock, while letting the atomic read happen without locking. Blocking can thus only happens among those threads that compete to initialize the singleton. The reason behind the algorithm name is that the instance pointer is checked twice, before and after getting the lock on it:

atomic Widget::pInstance{ nullptr };
Widget* Widget::Instance() {
if (pInstance == nullptr) {
lock_guard lock { mutW };
if (pInstance == nullptr) {
pInstance = new Widget();
}
}
}
Example: Producer-consumers

The second example described by Herb is a classical producer-consumers algorithm. He begins by describing a traditional solution using locks where:

The producer gets a lock on the shared queue and push a few objects on to it; for each objects, a condition variable is notified, so consumers know.
On the other hand, consumers just try to get a lock on the queue and when one gets it, it checks if there is an object to consume; if there is, then it is popped and processed after releasing the mutex.
A first variant of this algorithm is possible using a lock-free technique, whereby the slist is accessed through an atomic variable. What this allow is for the producer to create all of its items at once, then publish them to the consumer by atomically setting the queue’s head. Consumers remain the same.

Next, a full lock-free implementation is considered. In this case, the algorithm is based on the idea that the producer has to fill a certain number of slots. When it has got a new task, it checks if there is an empty slot, and it store the task there. In the following code, slot is an atomic variable:

curr = 0;
// Phase 1: build and distribute tasks
while (ThereAreMoreTasks()) {
task = AllocateAndBuildNewTask();
while (slot[curr] != null)
curr = (curr+1)%K;
slot[curr] = task;
sem[curr].signal();
}
// Phase 2: Stuff the mailbox with done
numNotified = 0;
while (numNotified < K) {
while (slot[curr] != null)
curr = (curr+1)%K;
slot[curr] = done;
sem[curr].signal();
++numNotified;
}
As to consumers, the code is simpler:

myTask = null;
while (myTask != done) {
while (myTask = slot[mySlot]) == null)
sem[mySlot].wait();
if (myTask != done) {
slot[mySlot] = null;
DoWork(myTask);
}
}
A consumer waits on a semaphore until some task is in slot. When a task comes in, the consumer gets it and empties the slot. After doing that, the consumer starts processing its data. This responds to the same idea as before of doing work outside of the critical section. But if the consumer is slower than the producer, then it could make sense to do the work before releasing the lock, so that the producer would not fill that same slot again while the consumer is busy, but would preferably fill another empty slot. This shows how you can make different decisions subtly affecting throughput/scalability vs. load balancing.

Example: Singly-linked list

A singly-linked list is one of the simplest possible data structures, supporting in this case just four operations: construct, destroy, find, push_front.

The lock-free implementation proposed by Herb uses an atomic head{ nullptr }; to access the slist head. The only method actually presenting concurrency issues is push_front, which in a single-threaded version could look like this:

template
void slist::push_front(T t) {
auto p = new Node;
p->t = t;
p->next = head;
head = p;
}
This code has problems since it opens up the possibility of races when setting the new head value. We can fix this problem by using compare_exchange when writing to head, as shown below:

template
void slist::push_front(T t) {
auto p = new Node;
p->t = t;
p->next = head;

while (!head.compare_exchange_weak(p->next, p))
{}
}
Here we try to swap head and p until we succeed. This use of compareexchangeweak is typical in lock-free code. It is used mostly with loops, while outside of loops compareexchangestrong is used.

More issues arise when trying to implement a pop operation, which will erase the first element from the list. In this case, one major cause of complexity is the possibility of returning a pointer to an object that could be deleted by another thread shortly thereafter. This problem is well-known in literature as the ABA problem, and Herb goes into detail describing how it can arise in the given scenario.

C++11 allows an elegant solution to this problem, by not using raw pointers and replacing them with a shared_ptr. In pseudo-code, the implementation becomes:

template
struct Node { T t; shared_ptr next; };
atomic> head;

public:
slist() =default;
~slist() =default;

class reference {
shared_ptr p;
public:
reference(sharedptr p) : p{_p} {}
T& operator() { return p->t; }
T operator->() { return &p->t; }
};
auto find(T t) const {
auto p = head.load();
while (p && p->t != t)
p = p->next;
return reference{move(p)};

void push_front(T t) {
auto p = make_shared();
p->t = t;
p->next = head;
while (head.compare_exchange_weak(p->next, p))
{}
}

void pop_front() {
auto p = head.load();
while (p && !head.compare_exchange_weak(p, p->next))
{}
}
};
Here the trick is that the pointer is returned wrapped into a shared_ptr, so we are safe as to deletion concerns.

This implementation shows a nice property known as linearizability, which makes that a series of overlapping operations, can be though of as if they were executed in sequence.

The final part of the talk is dedicated to discussing an example of measuring a program to find out how it behaves and what benefits it can get from going lock-free.

无锁编程

Posted on 2017-04-04

Lockless programming is a way to safely share changing data between multiple threads without the cost of acquiring and releasing locks. This sounds like a panacea, but lockless programming is complex and subtle, and sometimes doesn’t give the benefits that it promises. Lockless programming is particularly complex on Xbox 360.
Lockless programming is a valid technique for multithreaded programming, but it should not be used lightly. Before using it you must understand the complexities, and you should measure carefully to make sure that it is actually giving you the gains that you expect. In many cases, there are simpler and faster solutions, such as sharing data less frequently, which should be used instead.
Using lockless programming correctly and safely requires significant knowledge of both your hardware and your compiler. This article gives an overview of some of the issues to consider when trying to use lockless programming techniques.
Programming with Locks

When writing multi-threaded code it is often necessary to share data between threads. If multiple threads are simultaneously reading and writing the shared data structures, memory corruption can occur. The simplest way of solving this problem is to use locks. For instance, if ManipulateSharedData should only be executed by one thread at a time, a CRITICAL_SECTION can be used to guarantee this, as in the following code:

// Initialize
CRITICAL_SECTION cs;
InitializeCriticalSection(&cs);

// Use
void ManipulateSharedData()
{
EnterCriticalSection(&cs);
// Manipulate stuff…
LeaveCriticalSection(&cs);
}

// Destroy
DeleteCriticalSection(&cs);

This code is fairly simple and straightforward, and it is easy to tell that it is correct. However, programming with locks comes with several potential disadvantages. For example, if two threads try to acquire the same two locks but acquire them in a different order, you may get a deadlock. If a program holds a lock for too long—because of poor design or because the thread has been swapped out by a higher priority thread—other threads may be blocked for a long time. This risk is particularly great on Xbox 360 because the software threads are assigned a hardware thread by the developer, and the operating system won’t move them to another hardware thread, even if one is idle. The Xbox 360 also has no protection against priority inversion, where a high-priority thread spins in a loop while waiting for a low-priority thread to release a lock. Finally, if a deferred procedure call or interrupt service routine tries to acquire a lock, you may get a deadlock.
Despite these problems, synchronization primitives, such as critical sections, are generally the best way of coordinating multiple threads. If the synchronization primitives are too slow, the best solution is usually to use them less frequently. However, for those who can afford the extra complexity, another option is lockless programming.
Lockless Programming

Lockless programming, as the name suggests, is a family of techniques for safely manipulating shared data without using locks. There are lockless algorithms available for passing messages, sharing lists and queues of data, and other tasks.
When doing lockless programming, there are two challenges that you must deal with: non-atomic operations and reordering.
Non-Atomic Operations

An atomic operation is one that is indivisible—one where other threads are guaranteed to never see the operation when it is half done. Atomic operations are important for lockless programming, because without them, other threads might see half-written values, or otherwise inconsistent state.
On all modern processors, you can assume that reads and writes of naturally aligned native types are atomic. As long as the memory bus is at least as wide as the type being read or written, the CPU reads and writes these types in a single bus transaction, making it impossible for other threads to see them in a half-completed state. On x86 and x64 there, is no guarantee that reads and writes larger than eight bytes are atomic. This means that 16-byte reads and writes of streaming SIMD extension (SSE) registers, and string operations, might not be atomic.
Reads and writes of types that are not naturally aligned—for instance, writing DWORDs that cross four-byte boundaries—are not guaranteed to be atomic. The CPU may have to do these reads and writes as multiple bus transactions, which could allow another thread to modify or see the data in the middle of the read or write.
Composite operations, such as the read-modify-write sequence that occurs when you increment a shared variable, are not atomic. On Xbox 360, these operations are implemented as multiple instructions (lwz, addi, and stw), and the thread could be swapped out partway through the sequence. On x86 and x64, there is a single instruction (inc) that can be used to increment a variable in memory. If you use this instruction, incrementing a variable is atomic on single-processor systems, but it is still not atomic on multi-processor systems. Making inc atomic on x86- and x64-based multi-processor systems requires using the lock prefix, which prevents another processor from doing its own read-modify-write sequence between the read and the write of the inc instruction.
The following code shows some examples:

// This write is not atomic because it is not natively aligned.
DWORD pData = (DWORD)(pChar + 1);
*pData = 0;

// This is not atomic because it is three separate operations.
++g_globalCounter;

// This write is atomic.
g_alignedGlobal = 0;

// This read is atomic.
DWORD local = g_alignedGlobal;

Guaranteeing Atomicity

You can be sure you are using atomic operations by a combination of the following:
Naturally atomic operations
Locks to wrap composite operations
Operating system functions that implement atomic versions of popular composite operations
Incrementing a variable is not an atomic operation, and incrementing may lead to data corruption if executed on multiple threads.

// This will be atomic.
g_globalCounter = 0;

// This is not atomic and gives undefined behavior
// if executed on multiple threads
++g_globalCounter;

Win32 comes with a family of functions that offer atomic read-modify-write versions of several common operations. These are the InterlockedXxx family of functions. If all modifications of the shared variable use these functions, the modifications will be thread safe.

// Incrementing our variable in a safe lockless way.
InterlockedIncrement(&g_globalCounter);

Reordering

A more subtle problem is reordering. Reads and writes do not always happen in the order that you have written them in your code, and this can lead to very confusing problems. In many multi-threaded algorithms, a thread writes some data and then writes to a flag that tells other threads that the data is ready. This is known as a write-release. If the writes are reordered, other threads may see that the flag is set before they can see the written data.
Similarly, in many cases, a thread reads from a flag and then reads some shared data if the flag says that the thread has acquired access to the shared data. This is known as a read-acquire. If reads are reordered, then the data may be read from shared storage before the flag, and the values seen might not be up to date.
Reordering of reads and writes can be done both by the compiler and by the processor. Compilers and processors have done this reordering for years, but on single-processor machines it was less of an issue. This is because CPU rearrangement of reads and writes is invisible on single-processor machines (for non-device driver code that is not part of a device driver), and compiler rearrangement of reads and writes is less likely to cause problems on single-processor machines.
If the compiler or the CPU rearranges the writes shown in the following code, another thread may see that the alive flag is set while still seeing the old values for x or y. Similar rearrangement can happen when reading.
In this code, one thread adds a new entry to the sprite array:

// Create a new sprite by writing its position into an empty
// entry and then setting the ‘alive’ flag. If ‘alive’ is
// written before x or y then errors may occur.
g_sprites[nextSprite].x = x;
g_sprites[nextSprite].y = y;
g_sprites[nextSprite].alive = true;

In this next code block, another thread reads from the sprite array:

// Draw all sprites. If the reads of x and y are moved ahead of
// the read of ‘alive’ then errors may occur.
for( int i = 0; i < numSprites; ++i )
{
if( g_sprites[nextSprite].alive )
{
DrawSprite( g_sprites[nextSprite].x,
g_sprites[nextSprite].y );
}
}

To make this sprite system safe, we need to prevent both compiler and CPU reordering of reads and writes.
Understanding CPU Rearrangement of Writes
Some CPUs rearrange writes so that they are externally visible to other processors or devices in non-program order. This rearranging is never visible to single-threaded non-driver code, but it can cause problems in multi-threaded code.
Xbox 360
While the Xbox 360 CPU does not reorder instructions, it does rearrange write operations, which complete after the instructions themselves. This rearranging of writes is specifically allowed by the PowerPC memory model.
Writes on Xbox 360 do not go directly to the L2 cache. Instead, in order to improve L2 cache write bandwidth, they go through store queues and then to store-gather buffers. The store-gather buffers allow 64-byte blocks to be written to the L2 cache in one operation. There are eight store-gather buffers, which allow efficient writing to several different areas of memory.
The store-gather buffers are normally written to the L2 cache in first-in-first-out (FIFO) order. However, if the target cache-line of a write is not in the L2 cache, that write may be delayed while the cache-line is fetched from memory.
Even when the store-gather buffers are written to the L2 cache in strict FIFO order, this does not guarantee that individual writes are written to the L2 cache in order. For instance, imagine that the CPU writes to location 0x1000, then to location 0x2000, and then to location 0x1004. The first write allocates a store-gather buffer and puts it at the front of the queue. The second write allocates another store-gather buffer and puts it next in the queue. The third write adds its data to the first store-gather buffer, which remains at the front of the queue. Thus, the third write ends up going to the L2 cache before the second write.
Reordering caused by store-gather buffers is fundamentally unpredictable, especially because both threads on a core share the store-gather buffers, making the allocation and emptying of the store-gather buffers highly variable.
This is one example of how writes can be reordered. There may be other possibilities.
x86 and x64
Even though x86 and x64 CPUs do reorder instructions, they generally do not reorder write operations relative to other writes. There are some exceptions for write-combined memory. Additionally, string operations (MOVS and STOS) and 16-byte SSE writes can be internally reordered, but otherwise, writes are not reordered relative to each other.
Understanding CPU Rearrangement of Reads
Some CPUs rearrange reads so that they effectively come from shared storage in non-program order. This rearranging is never visible to single-threaded non-driver code, but can cause problems in multi-threaded code.
Xbox 360
Cache misses can cause some reads to be delayed, which effectively causes reads to come from shared memory out of order, and the timing of these cache misses is fundamentally unpredictable. Prefetching and branch prediction can also cause data to come from shared memory out of order. These are just a few examples of how reads can be reordered. There may be other possibilities. This rearranging of reads is specifically allowed by the PowerPC memory model.
x86 and x64
Even though x86 and x64 CPUs do reorder instructions, they generally do not reorder read operations relative to other reads. String operations (MOVS and STOS) and 16-byte SSE reads can be internally reordered, but otherwise, reads are not reordered relative to each other.
Other Reordering
Even though x86 and x64 CPUs do not reorder writes relative to other writes, or reorder reads relative to other reads, they can reorder reads relative to writes. Specifically, if a program writes to one location followed by reading from a different location, the read data may come from shared memory before the written data makes it there. This reordering can break some algorithms, such as Dekker’s mutual exclusion algorithms. In Dekker’s algorithm, each thread sets a flag to indicate that it wants to enter the critical region, and then checks the other thread’s flag to see if the other thread is in the critical region or trying to enter it. The initial code follows.

volatile bool f0 = false;
volatile bool f1 = false;

void P0Acquire()
{
// Indicate intention to enter critical region
f0 = true;
// Check for other thread in or entering critical region
while (f1)
{
// Handle contention.
}
// critical region
…
}

void P1Acquire()
{
// Indicate intention to enter critical region
f1 = true;
// Check for other thread in or entering critical region
while (f0)
{
// Handle contention.
}
// critical region
…
}

The problem is that the read of f1 in P0Acquire can read from shared storage before the write to f0 makes it to shared storage. Meanwhile, the read of f0 in P1Acquire can read from shared storage before the write to f1 makes it to shared storage. The net effect is that both threads set their flags to TRUE, and both threads see the other thread’s flag as being FALSE, so they both enter the critical region. Therefore, while problems with reordering on x86- and x64-based systems are less common than on Xbox 360, they definitely can still happen. Dekker’s algorithm will not work without hardware memory barriers on any of these platforms.
x86 and x64 CPUs will not reorder a write ahead of a previous read. x86 and x64 CPUs only reorder reads ahead of previous writes if they target different locations.
PowerPC CPUs can reorder reads ahead of writes, and can reorder writes ahead of reads, as long as they are to different addresses.
Reordering Summary
The Xbox 360 CPU reorders memory operations much more aggressively than do x86 and x64 CPUs, as shown in the following table. For more details, consult the processor documentation.
Reordering Activity x86 and x64 Xbox 360
Reads moving ahead of reads No Yes
Writes moving ahead of writes No Yes
Writes moving ahead of reads No Yes
Reads moving ahead of writes Yes Yes

Read-Acquire and Write-Release Barriers

The main constructs used to prevent reordering of reads and writes are called read-acquire and write-release barriers. A read-acquire is a read of a flag or other variable to gain ownership of a resource, coupled with a barrier against reordering. Similarly, a write-release is a write of a flag or other variable to give away ownership of a resource, coupled with a barrier against reordering.
The formal definitions, courtesy of Herb Sutter, are:
A read-acquire executes before all reads and writes by the same thread that follow it in program order.
A write-release executes after all reads and writes by the same thread that precede it in program order.
When your code acquires ownership of some memory, either by acquiring a lock or by pulling an item off of a shared linked list (without a lock), there is always a read involved—testing a flag or pointer to see if ownership of the memory has been acquired. This read may be part of an InterlockedXxx operation, in which case it involves both a read and a write, but it is the read that indicates whether ownership has been gained. After ownership of the memory is acquired, values are typically read from or written to that memory, and it is very important that these reads and writes execute after acquiring ownership. A read-acquire barrier guarantees this.
When ownership of some memory is released, either by releasing a lock or by pushing an item on to a shared linked list, there is always a write involved which notifies other threads that the memory is now available to them. While your code had ownership of the memory, it probably read from or wrote to it, and it is very important that these reads and writes execute before releasing ownership. A write-release barrier guarantees this.
It is simplest to think of read-acquire and write-release barriers as single operations. However, they sometimes have to be constructed from two parts: a read or write and a barrier that does not allow reads or writes to move across it. In this case, the placement of the barrier is critical. For a read-acquire barrier, the read of the flag comes first, then the barrier, and then the reads and writes of the shared data. For a write-release barrier the reads and writes of the shared data come first, then the barrier, and then the write of the flag.

// Read that acquires the data.
if( g_flag )
{
// Guarantee that the read of the flag executes before
// all reads and writes that follow in program order.
BarrierOfSomeSort();

// Now we can read and write the shared data.
int localVariable = sharedData.y;
sharedData.x = 0;

// Guarantee that the write to the flag executes after all
// reads and writes that precede it in program order.
BarrierOfSomeSort();

// Write that releases the data.
g_flag = false;

}

The only difference between a read-acquire and a write-release is the location of the memory barrier. A read-acquire has the barrier after the lock operation, and a write-release has the barrier before. In both cases the barrier is in-between the references to the locked memory and the references to the lock.
To understand why barriers are needed both when acquiring and when releasing data, it is best (and most accurate) to think of these barriers as guaranteeing synchronization with shared memory, not with other processors. If one processor uses a write-release to release a data structure to shared memory, and another processor uses a read-acquire to gain access to that data structure from shared memory, the code will then work properly. If either processor doesn’t use the appropriate barrier, the data sharing may fail.
Using the right barrier to prevent compiler and CPU reordering for your platform is critical.
One of the advantages of using the synchronization primitives provided by the operating system is that all of them include the appropriate memory barriers.
Preventing Compiler Reordering

A compiler’s job is to aggressively optimize your code in order to improve performance. This includes rearranging instructions wherever it is helpful and wherever it will not change behavior. Because the C++ Standard never mentions multithreading, and because the compiler doesn’t know what code needs to be thread-safe, the compiler assumes that your code is single-threaded when deciding what rearrangements it can safely do. Therefore, you need to tell the compiler when it is not allowed to reorder reads and writes.
With Visual C++ you can prevent compiler reordering by using the compiler intrinsic _ReadWriteBarrier. Where you insert _ReadWriteBarrier into your code, the compiler will not move reads and writes across it.

#if _MSC_VER < 1400
// With VC++ 2003 you need to declare _ReadWriteBarrier
extern “C” void _ReadWriteBarrier();

#else
// With VC++ 2005 you can get the declaration from intrin.h

#include <intrin.h>

#endif
// Tell the compiler that this is an intrinsic, not a function.

#pragma intrinsic(_ReadWriteBarrier)

// Create a new sprite by filling in a previously empty entry.
g_sprites[nextSprite].x = x;
g_sprites[nextSprite].y = y;
// Write-release, barrier followed by write.
// Guarantee that the compiler leaves the write to the flag
// after all reads and writes that precede it in program order.
_ReadWriteBarrier();
g_sprites[nextSprite].alive = true;

In the following code, another thread reads from the sprite array:

// Draw all sprites.
for( int i = 0; i < numSprites; ++i )
{

// Read-acquire, read followed by barrier.
if( g_sprites[nextSprite].alive )
{

    // Guarantee that the compiler leaves the read of the flag
    // before all reads and writes that follow in program order.
    _ReadWriteBarrier();
    DrawSprite( g_sprites[nextSprite].x,
            g_sprites[nextSprite].y );
}

}

It is important to understand that _ReadWriteBarrier does not insert any additional instructions, and it does not prevent the CPU from rearranging reads and writes—it only prevents the compiler from rearranging them. Thus, _ReadWriteBarrier is sufficient when you implement a write-release barrier on x86 and x64 (because x86 and x64 do not reorder writes, and a normal write is sufficient for releasing a lock), but in most other cases, it is also necessary to prevent the CPU from reordering reads and writes.
You can also use _ReadWriteBarrier when you write to non-cacheable write-combined memory to prevent reordering of writes. In this case _ReadWriteBarrier helps to improve performance, by guaranteeing that the writes happen in the processor’s preferred linear order.
It is also possible to use the _ReadBarrier and _WriteBarrier intrinsics for more precise control of compiler reordering. The compiler will not move reads across a _ReadBarrier, and it will not move writes across a _WriteBarrier.
Preventing CPU Reordering

CPU reordering is more subtle than compiler reordering. You can’t ever see it happen directly, you just see inexplicable bugs. In order to prevent CPU reordering of reads and writes you need to use memory barrier instructions, on some processors. The all-purpose name for a memory barrier instruction, on Xbox 360 and on Windows, is MemoryBarrier. This macro is implemented appropriately for each platform.
On Xbox 360, MemoryBarrier is defined as lwsync (lightweight sync), also available through the lwsync intrinsic, which is defined in ppcintrinsics.h. lwsync also serves as a compiler memory barrier, preventing rearranging of reads and writes by the compiler.
The lwsync instruction is a memory barrier on Xbox 360 that synchronizes one processor core with the L2 cache. It guarantees that all writes before lwsync make it to the L2 cache before any writes that follow. It also guarantees that any reads that follow lwsync don’t get older data from L2 than previous reads. The one type of reordering that it does not prevent is a read moving ahead of a write to a different address. Thus, lwsync enforces memory ordering that matches the default memory ordering on x86 and x64 processors. To get full memory ordering requires the more expensive sync instruction (also known as heavyweight sync), but in most cases, this is not required. The memory reordering options on Xbox 360 are shown in the following table.
Xbox 360 Reordering No sync lwsync sync
Reads moving ahead of reads Yes No No
Writes moving ahead of writes Yes No No
Writes moving ahead of reads Yes No No
Reads moving ahead of writes Yes Yes No

PowerPC also has the synchronization instructions isync and eieio (which is used to control reordering to caching-inhibited memory). These synchronization instructions should not be needed for normal synchronization purposes.
On Windows, MemoryBarrier is defined in Winnt.h and gives you a different memory barrier instruction depending on whether you are compiling for x86 or x64. The memory barrier instruction serves as a full barrier, preventing all reordering of reads and writes across the barrier. Thus, MemoryBarrier on Windows gives a stronger reordering guarantee than it does on Xbox 360.
On Xbox 360, and on many other CPUs, there is one additional way that read-reordering by the CPU can be prevented. If you read a pointer and then use that pointer to load other data, the CPU guarantees that the reads off of the pointer are not older than the read of the pointer. If your lock flag is a pointer and if all reads of shared data are off of the pointer, the MemoryBarrier can be omitted, for a modest performance savings.

Data* localPointer = g_sharedPointer;
if( localPointer )
{
// No import barrier is needed–all reads off of localPointer
// are guaranteed to not be reordered past the read of
// localPointer.
int localVariable = localPointer->y;
// A memory barrier is needed to stop the read of g_global
// from being speculatively moved ahead of the read of
// g_sharedPointer.
int localVariable2 = g_global;
}

The MemoryBarrier instruction only prevents reordering of reads and writes to cacheable memory. If you allocate memory as PAGE_NOCACHE or PAGE_WRITECOMBINE, a common technique for device driver authors and for game developers on Xbox 360, MemoryBarrier has no effect on accesses to this memory. Most developers don’t need synchronization of non-cacheable memory. That is beyond the scope of this article.
Interlocked Functions and CPU Reordering

Sometimes the read or write that acquires or releases a resource is done using one of the InterlockedXxx functions. On Windows, this simplifies things; because on Windows, the InterlockedXxx functions are all full-memory barriers. They effectively have a CPU memory barrier both before and after them, which means that they are a full read-acquire or write-release barrier all by themselves.
On Xbox 360, the InterlockedXxx functions do not contain CPU memory barriers. They prevent compiler reordering of reads and writes but not CPU reordering. Therefore, in most cases when using InterlockedXxx functions on Xbox 360, you should precede or follow them with an lwsync, to make them a read-acquire or write-release barrier. For convenience and for easier readability, there are Acquire and Release versions of many of the InterlockedXxx functions. These come with a built-in memory barrier. For instance, InterlockedIncrementAcquire does an interlocked increment followed by an lwsync memory barrier to give the full read-acquire functionality.
It is recommended that you use the Acquire and Release versions of the InterlockedXxx functions (most of which are available on Windows as well, with no performance penalty) to make your intent more obvious and to make it easier to get the memory barrier instructions in the correct place. Any use of InterlockedXxx on Xbox 360 without a memory barrier should be examined very carefully, because it is often a bug.
This sample demonstrates how one thread can pass tasks or other data to another thread using the Acquire and Release versions of the InterlockedXxxSList functions. The InterlockedXxxSList functions are a family of functions for maintaining a shared singly linked list without a lock. Note that Acquire and Release variants of these functions are not available on Windows, but the regular versions of these functions are a full memory barrier on Windows.

// Declarations for the Task class go here.

// Add a new task to the list using lockless programming.
void AddTask( DWORD ID, DWORD data )
{
Task* newItem = new Task( ID, data );
InterlockedPushEntrySListRelease( g_taskList, newItem );
}

// Remove a task from the list, using lockless programming.
// This will return NULL if there are no items in the list.
Task GetTask()
{
Task result = (Task*)
InterlockedPopEntrySListAcquire( g_taskList );
return result;
}

Volatile Variables and Reordering

The C++ Standard says that reads of volatile variables cannot be cached, volatile writes cannot be delayed, and volatile reads and writes cannot be moved past each other. This is sufficient for communicating with hardware devices, which is the purpose of the volatile keyword in the C++ Standard.
However, the guarantees of the standard are not sufficient for using volatile for multi-threading. The C++ Standard does not stop the compiler from reordering non-volatile reads and writes relative to volatile reads and writes, and it says nothing about preventing CPU reordering.
Visual C++ 2005 goes beyond standard C++ to define multi-threading-friendly semantics for volatile variable access. Starting with Visual C++ 2005, reads from volatile variables are defined to have read-acquire semantics, and writes to volatile variables are defined to have write-release semantics. This means that the compiler will not rearrange any reads and writes past them, and on Windows it will ensure that the CPU does not do so either.
It is important to understand that these new guarantees only apply to Visual C++ 2005 and future versions of Visual C++. Compilers from other vendors will generally implement different semantics, without the extra guarantees of Visual C++ 2005. Also, on Xbox 360, the compiler does not insert any instructions to prevent the CPU from reordering reads and writes.
Example of a Lock-Free Data Pipe

A pipe is a construct that lets one or more threads write data that is then read by other threads. A lockless version of a pipe can be an elegant and efficient way to pass work from thread to thread. The DirectX SDK supplies LockFreePipe, a single-reader, single-writer lockless pipe that is available in DXUTLockFreePipe.h. The same LockFreePipe is available in the Xbox 360 SDK in AtgLockFreePipe.h.
LockFreePipe can be used when two threads have a producer/consumer relationship. The producer thread can write data to the pipe for the consumer thread to process at a later date, without ever blocking. If the pipe fills up, writes fail, and the producer thread will have to try again later, but this would only happen if the producer thread is ahead. If the pipe empties, reads fail, and the consumer thread will have to try again later, but this would only happen if there is no work for the consumer thread to do. If the two threads are well-balanced, and the pipe is big enough, the pipe lets them smoothly pass data along with no delays or blocks.
Xbox 360 Performance

The performance of synchronization instructions and functions on Xbox 360 will vary depending on what other code is running. Acquiring locks will take much longer if another thread currently owns the lock. InterlockedIncrement and critical section operations will take much longer if other threads are writing to the same cache line. The contents of the store queues can also affect performance. Therefore, all of these numbers are just approximations, generated from very simple tests:
lwsync was measured as taking 33-48 cycles.
InterlockedIncrement was measured as taking 225-260 cycles.
Acquiring or releasing a critical section was measured as taking about 345 cycles.
Acquiring or releasing a mutex was measured as taking about 2350 cycles.
Windows Performance

The performance of synchronization instructions and functions on Windows vary widely depending on the processor type and configuration, and on what other code is running. Multi-core and multi-socket systems often take longer to execute synchronizing instructions, and acquiring locks take much longer if another thread currently owns the lock.
However, even some measurements generated from very simple tests are helpful:
MemoryBarrier was measured as taking 20-90 cycles.
InterlockedIncrement was measured as taking 36-90 cycles.
Acquiring or releasing a critical section was measured as taking 40-100 cycles.
Acquiring or releasing a mutex was measured as taking about 750-2500 cycles.
These tests were done on Windows XP on a range of different processors. The short times were on a single-processor machine, and the longer times were on a multi-processor machine.
While acquiring and releasing locks is more expensive than using lockless programming, it is even better to share data less frequently, thus avoiding the cost altogether.
Performance Thoughts

Acquiring or releasing a critical section consists of a memory barrier, an InterlockedXxx operation, and some extra checking to handle recursion and to fall back to a mutex, if necessary. You should be wary of implementing your own critical section, because spinning in a loop waiting for a lock to be free, without falling back to a mutex, can waste considerable performance. For critical sections that are heavily contended but not held for long, you should consider using InitializeCriticalSectionAndSpinCount so that the operating system will spin for a while waiting for the critical section to be available rather than immediately deferring to a mutex if the critical section is owned when you try to acquire it. In order to identify critical sections that can benefit from a spin count, it is necessary to measure the length of the typical wait for a particular lock.
If a shared heap is used for memory allocations—the default behavior—every memory allocation and free involves acquiring a lock. As the number of threads and the number of allocations increases, performance levels off, and eventually starts to decrease. Using per-thread heaps, or reducing the number of allocations, can avoid this locking bottleneck.
If one thread is generating data and another thread is consuming data, they may end up sharing data frequently. This can happen if one thread is loading resources and another thread is rendering the scene. If the rendering thread references the shared data on every draw call, the locking overhead will be high. Much better performance can be realized if each thread has private data structures which are then synchronized once per frame or less.
Lockless algorithms are not guaranteed to be faster than algorithms that use locks. You should check to see if locks are actually causing you problems before trying to avoid them, and you should measure to see if your lockless code actually improves performance.
Platform Differences Summary

InterlockedXxx functions prevent CPU read/write reordering on Windows, but not on Xbox 360.
Reading and writing of volatile variables using Visual Studio C++ 2005 prevents CPU read/write reordering on Windows, but on Xbox 360, it only prevents compiler read/write reordering.
Writes are reordered on Xbox 360, but not on x86 or x64.
Reads are reordered on Xbox 360, but on x86 or x64 they are only reordered relative to writes, and only if the reads and writes target different locations.
Recommendations

Use locks when possible because they are easier to use correctly.
Avoid locking too frequently, so that locking costs do not become significant.
Avoid holding locks for too long, in order to avoid long stalls.
Use lockless programming when appropriate, but be sure that the gains justify the complexity.
Use lockless programming or spin locks in situations where other locks are prohibited, such as when sharing data between deferred procedure calls and normal code.
Only use standard lockless programming algorithms that have been proven to be correct.
When doing lockless programming, be sure to use volatile flag variables and memory barrier instructions as needed.
When using InterlockedXxx on Xbox 360, use the Acquire and Release variants.
References

MSDN Library. “volatile (C++).” C++ Language Reference.
Vance Morrison. “Understand the Impact of Low-Lock Techniques in Multithreaded Apps.” MSDN Magazine, October 2005.
Lyons, Michael. “PowerPC Storage Model and AIX Programming.” IBM developerWorks, 16 Nov 2005.
McKenney, Paul E. “Memory Ordering in Modern Microprocessors, Part II.” Linux Journal, September 2005. [This article has some x86 details.]
Intel Corporation. “Intel® 64 Architecture Memory Ordering.” August 2007. [Applies to both IA-32 and Intel 64 processors.]
Niebler, Eric. “Trip Report: Ad-Hoc Meeting on Threads in C++.” The C++ Source, 17 Oct 2006.
Hart, Thomas E. 2006. “Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation.” Proceedings of the 2006 International Parallel and Distributed Processing Symposium (IPDPS 2006), Rhodes Island, Greece, April 2006.

无锁编程

Posted on 2017-04-04

Lock-Free Data Structures
By Andrei Alexandrescu, October 01, 2004

Lock-free data structures guarantee the progress of at least one thread when executing mutlithreaded procedures, thereby helping you avoid deadlock.
Andrei Alexandrescu is a graduate student in Computer Science at the University of Washington and author of Modern C++ Design. He can be contacted at andrei@metalanguage.com.

After skipping an installment of “Generic“ (it’s naive, I know, to think that grad school asks for anything less than 100 percent of one’s time), there has been an embarrassment of riches as far as topic candidates for this article. One topic candidate was a discussion of constructors; in particular, forwarding constructors, handling exceptions, and two-stage object construction. One other topic candidate—and another glimpse into the Yaslander technology [2]—was creating containers (such as lists, vectors, or maps) of incomplete types, something that is possible with the help of an interesting set of tricks, but not guaranteed by the standard containers.

While both candidates are intriguing, they couldn’t stand a chance against lock-free data structures, which are all the rage in the multithreaded programming community. At this year’s Programming Language Design and Implementation conference (http://www.cs.umd.edu/~pugh/pldi04/), Michael Maged presented the world’s first lock-free memory allocator [7], which surpasses at many tests other more complex, carefully designed lock-based allocators.

This is the most recent of many lock-free data structures and algorithms that have appeared in the recent past.

What Do You Mean, “lock-free?”
That’s exactly what I would have asked only a while ago. As the bona-fide mainstream multithreaded programmer that I was, lock-based multithreaded algorithms were familiar to me. In classic lock-based programming, whenever you need to share some data, you need to serialize access to it. The operations that change data must appear as atomic, such that no other thread intervenes to spoil your data’s invariant. Even a simple operation such as ++count (where count is an integral type) must be locked. Incrementing is really a three-step (read, modify, write) operation that isn’t necessarily atomic.

In short, with lock-based multithreaded programming, you need to make sure that any operation on shared data that is susceptible to race conditions is made atomic by locking and unlocking a mutex. On the bright side, as long as the mutex is locked, you can perform just about any operation, in confidence that no other thread will trample on your shared state.

It is exactly this “arbitrary”-ness of what you can do while a mutex is locked that’s also problematic. You could, for example, read the keyboard or perform some slow I/O operation, which means that you delay any other threads waiting for the same mutex. Worse, you could decide you want access to some other piece of shared data and attempt to lock its mutex. If another thread has already locked that last mutex and wants access to the first mutex that your threads already holds, both processes hang faster than you can say “deadlock.”

Enter lock-free programming. In lock-free programming, you can’t do just about anything atomically. There is only a precious small set of things that you can do atomically, a limitation that makes lock-free programming way harder. (In fact, there must be around half a dozen lock-free programming experts around the world, and yours truly is not among them. With luck, however, this article will provide you with the basic tools, references, and enthusiasm to help you become one.) The reward of such a scarce framework is that you can provide much better guarantees about thread progress and the interaction between threads. But what’s that “small set of things” that you can do atomically in lock-free programming? In fact, what would be the minimal set of atomic primitives that would allow implementing any lock-free algorithm—if there’s such a set?

If you believe that’s a fundamental enough question to award a prize to the answerer, so did others. In 2003, Maurice Herlihy was awarded the Edsger W. Dijkstra Prize in Distributed Computing for his seminal 1991 paper “Wait-Free Synchronization” (see http://www.podc.org/dijkstra/2003.html, which includes a link to the paper, too). In his tour-de-force paper, Herlihy proves which primitives are good and which are bad for building lock-free data structures. That brought some seemingly hot hardware architectures to instant obsolescence, while clarifying what synchronization primitives should be implemented in future hardware.

For example, Herlihy’s paper gave impossiblity results, showing that atomic operations such as test-and-set, swap, fetch-and-add, or even atomic queues (!) are insufficient for properly synchronizing more than two threads. (That’s surprising because queues with atomic push and pop operations would seem to provide quite a powerful abstraction.) On the bright side, Herlihy also gave universality results, proving that some simple constructs are enough for implementing any lock-free algorithm for any number of threads.

The simplest and most popular universal primitive, and the one that I use throughout, is the compare-and-swap (CAS) operation:

template
bool CAS(T addr, T expected, T value) {
if (addr == expected) {
*addr = value;
return true;
}
return false;
}

CAS compares the content of a memory address with an expected value, and if the comparison succeeds, replaces the content with a new value. The entire procedure is atomic. Many modern processors implement CAS or equivalent primitives for different bit lengths (the reason for which we’ve made it a template, assuming an implementation uses metaprogramming to restrict possible Ts). As a rule of thumb, the more bits a CAS can compare-and-swap atomically, the easier it is to implement lock-free data structures with it. Most of today’s 32-bit processors implement 64-bit CAS; for example, Intel’s assembler calls it CMPXCHG8 (you gotta love those assembler mnemonics).

A Word of Caution
Usually a C++ article is accompanied by C++ code snippets and examples. Ideally, that code is Standard C++, and “Generic“ strives to live up to that ideal.

When writing about multithreaded code, however, giving Standard C++ code samples is simply impossible. Threads don’t exist in Standard C++, and you can’t code something that doesn’t exist. Therefore, the code for this article is “pseudocode” and not meant as Standard C++ code for portable compilation. Take memory barriers, for example. Real code would need to be either assembly language translations of the algorithms described herein, or at least sprinkle C++ code with some so-called “memory barriers”—processor-dependent magic that forces proper ordering of memory reads and writes. I don’t want to spread the discussion too thin by explaining memory barriers in addition to lock-free data structures. If you are interested, refer to Butenhof’s excellent book [3] or to a short introduction [6]. For purposes here, I assume that the compiler and the hardware don’t introduce funky optimizations (such as eliminating some “redundant” variable reads, a valid optimization under a single-thread assumption). Technically, that’s called a “sequentially consistent” model in which reads and writes are performed and seen in the exact order in which the source code does them [8].

Wait-Free and Lock-Free versus Locked
A “wait-free” procedure can complete in a finite number of steps, regardless of the relative speeds of other threads.

A “lock-free” procedure guarantees progress of at least one of the threads executing the procedure. That means some threads can be delayed arbitrarily, but it is guaranteed that at least one thread makes progress at each step. So the system as a whole always makes progress, although some threads might make slower progress than others. Lock-based programs can’t provide any of the aforementioned guarantees. If any thread is delayed while holding a lock to a mutex, progress cannot be made by threads that wait for the same mutex; and in the general case, locked algorithms are prey to deadlock—each waits for a mutex locked by the other—and livelock—each tries to dodge the other’s locking behavior, just like two dudes in the hallway trying to go past one another but end up doing that social dance of swinging left and right in synchronicity. We humans are pretty good at ending that with a laugh; processors, however, often enjoy doing it until rebooting sets them apart.

Wait-free and lock-free algorithms enjoy more advantages derived from their definitions:

Thread-killing Immunity: Any thread forcefully killed in the system won’t delay other threads.
Signal Immunity: The C and C++Standards prohibit signals or asynchronous interrupts from calling many system routines such as malloc. If the interrupt calls malloc at the same time with an interrupted thread, that could cause deadlock. With lock-free routines, there’s no such problem anymore: Threads can freely interleave execution.
Priority Inversion Immunity: Priority inversion occurs when a low-priority thread holds a lock to a mutex needed by a high-priority thread. Such tricky conflicts must be resolved by the OS kernel. Wait-free and lock-free algorithms are immune to such problems.
A Lock-Free WRRM Map
Column writing offers the perk of defining acronyms, so let’s define WRRM (Write Rarely Read Many) maps as maps that are read a lot more than they are mutated. Examples include object factories [1], many instances of the Observer design pattern [5], mappings of currency names to exchange rates that are looked up many, many times but are updated only by a comparatively slow stream, and various other look-up tables.

WRRM maps can be implemented via std::map or the post-standard unordered_map (http://www.open-std.org/jtcl/sc22/wg21/docs/papers/2004/n1647.pdf), but as I argue in Modern C++ Design, assoc_vector (a sorted vector or pairs) is a good candidate for WRRM maps because it trades update speed for lookup speed. Whatever structure is used, our lock-free aspect is orthogonal to it; let’s just call the back-end Map. Also, for the purposes of this article, iteration is irrelevant—maps are only tables that provide a means to lookup a key or update a key-value pair.

To recap how a locking implementation would look, let’s combine a Map object with a Mutex object like so:

// A locking implementation of WRRMMap
template
class WRRMMap {
Mutex mtx;
Map map;
public:
V Lookup(const K& k) {
Lock lock(mtx);
return map[k];
}
void Update(const K& k,
const V& v) {
Lock lock(mtx);
map[k] = v;
}
};

To avoid ownership issues and dangling references (that could bite us harder later), Lookup returns its result by value. Rock-solid—but at a cost. Every lookup locks/unlocks the Mutex, although (1) parallel lookups don’t need to interlock, and (2) by the spec, Update is much less often called than Lookup. Ouch! Let’s now try to provide a better WRRMMap implementation.

Garbage Collector, Where Art Thou?
The first shot at implementing a lock-free WRRMMap rests on this idea:

Reads have no locking at all.
Updates make a copy of the entire map, update the copy, and then try to CAS it with the old map. While the CAS operation does not succeed, the copy/update/CAS process is tried again in a loop.
Because CAS is limited in how many bytes it can swap, WRRMMap stores the Map as a pointer and not as a direct member of WRRMMap.
// 1st lock-free implementation of WRRMMap
// Works only if you have GC
template
class WRRMMap {
Map pMap_;
public:
V Lookup (const K& k) {
//Look, ma, no lock
return (pMap) [k];
}
void Update(const K& k,
const V& v) {
Map pNew = 0;
do {
Map pOld = pMap;
delete pNew;
pNew = new Map(pOld);
(pNew) [k] = v;
} while (!CAS(&pMap, pOld, pNew));
// DON’T delete pMap;
}
};

It works! In a loop, the Update routine makes a full-blown copy of the map, adds the new entry to it, and then attempts to swap the pointers. It is important to do CAS and not a simple assignment; otherwise, the following sequence of events could corrupt the map:

Thread A copies the map.
Thread B copies the map as well and adds an entry.
Thread A adds some other entry.
Thread A replaces the map with its version of the map—a version that does not contain whatever B added.
With CAS, things work pretty neatly because each thread says something like, “assuming the map hasn’t changed since I last looked at it, copy it. Otherwise, start all over again.”

This makes Update lock-free but not wait-free, by my definitions. If many threads call Update concurrently, any particular thread might loop indefinitely, but at all times some thread will be guaranteed to update the structure successfully, thus global progress is being made at each step. Luckily, Lookup is wait-free.

In a garbage-collected environment, we’d be done, and this article would end on an upbeat note. Without garbage collection, however, there is much pain to come. This is because you cannot simply dispose of the old pMap willy-nilly; what if, just as you are trying to delete it, many other threads are frantically looking for things inside pMap via the Lookup function? You see, a garbage collector would have access to all threads’ data and private stacks; it would have a good perspective on when the unused pMap_ pointers aren’t perused anymore, and would nicely scavenge them. Without a garbage collector, things get harder. Much harder, actually, and it turns out that deterministic memory freeing is quite a fundamental problem in lock-free data structures.

Write-Locked WRRM Maps
To understand the viciousness of our adversary, it is instructive to try a classic reference-counting implementation and see where it fails. So, think of associating a reference count with the pointer to map, and have WRRMMap store a pointer to the thusly formed structure:

template
class WRRMMap {
typedef std::pair,
unsigned> Data;
Data pData_;
…
};

Sweet. Now, Lookup increments pData->second, searches through the map all it wants, then decrements pData->second. When the reference count hits zero, pData->first can be deleted, and then so can pData itself. Sounds foolproof, except…except it’s “foolish” (or whatever the antonym to “foolproof” is). Imagine that right at the time some thread notices the refcount is zero and proceeds on deleting pData, another thread…no, better: A bazillion threads have just loaded the moribund pData and are about to read through it! No matter how smart a scheme is, it hits this fundamental Catch-22—to read the pointer to the data, you need to increment a reference count; but the counter must be part of the data itself, so it can’t be read without accessing the pointer first. It’s like an electric fence that has the turn-off button up on top of it: To safely climb the fence you need to disable it first, but to disable it you need to climb it.

So let’s think of other ways to delete the old map properly. One solution would be to wait, then delete. You see, the old pMap objects will be looked up by less and less threads as processor eons (milliseconds) go by; this is because new lookups use the new maps; as soon as the lookups that were active just before the CAS finish, the pMap is ready to go to Hades. Therefore, a solution would be to queue up old pMap_ values to some “boa serpent” thread that, in a loop, sleeps for, say, 200 milliseconds, wakes up and deletes the least recent map, and then goes back to sleep for digestion.

This is not a theoretically safe solution (although it practically could well be within bounds). One nasty thing is that if, for whatever reason, a lookup thread is delayed, the boa serpent thread can delete the map under that thread’s feet. This could be solved by always assigning the boa serpent thread a lower priority than any other’s, but as a whole, the solution has a stench that is hard to remove. If you agree with me that it’s hard to defend this technique with a straight face, let’s move on.

Other solutions [4] rely on an extended DCAS atomic instruction, which is able to compare-and-swap two noncontiguous words in memory:

template
bool DCAS(T1 p1, T2 p2,
T1 e1, T2 e2,
T1 v1, T2 v2) {
if (p1 == e1 && p2 == e2) {
p1 = v1; p2 = v2;
return true;
}
return false;
}

Naturally, the two locations would be the pointer and the reference count itself. DCAS has been implemented (very inefficiently) by the Motorola 68040 processors, but not by other processors. Because of that, DCAS-based solutions are considered of primarily theoretical value.

The first shot at a solution with deterministic destruction is to rely on the less-demanding CAS2. Again, many 32-bit machines implement a 64-bit CAS, often dubbed as CAS2. (Because it only operates on contiguous words, CAS2 is obviously less powerful than DCAS.) For starters, let’s store the reference count next to the pointer that it guards:

template
class WRRMMap {
typedef std::pair*,
unsigned> Data;
Data data_;
…
};

(Notice that this time the count sits next to the pointer that it protects, a setup that eliminates the Catch-22 problem mentioned earlier. You’ll see the cost of this setup in a minute.)

Then, let’s modify Lookup to increment the reference count before accessing the map, and decrement it after. In the following code snippets, I ignore exception safety issues (which can be taken care of with standard techniques) for the sake of brevity.

V Lookup(const K& k) {
Data old;
Data fresh;
do {
old = data;
fresh = old;
++fresh.second;
} while (CAS(&data, old, fresh));
V temp = (*fresh.first)[k];
do {
old = data;
fresh = old;
–fresh.second;
} while (CAS(&data, old, fresh));
return temp;
}

Finally, Update replaces the map with a new one—but only in the window of opportunity when the reference count is 1.

void Update(const K& k,
const V& v) {
Data old;
Data fresh;
old.second = 1;
fresh.first = 0;
fresh.second = 1;
Map* last = 0;
do {
old.first = data_.first;
if (last != old.first) {
delete fresh.first;
fresh.first = new Map(old.first);
fresh.first->insert(makepair(k, v));
last = old.first;
}
} while (!CAS(&data, old, fresh));
delete old.first; // whew
}

Here’s how Update works. It defines the now-familiar old and fresh variables. But this time old.last (the count) is never assigned from data_.last; it is always 1. This means that Update loops until it has a window of opportunity to replace a pointer with a counter of 1, with another pointer having a counter of 1. In plain English, the loop says “I’ll replace the old map with a new, updated one, and I’ll be on the lookout for any other updates of the map, but I’ll only do the replacement when the reference count of the existing map is one.” The variable last and its associated code are only one optimization: Avoid rebuilding the map over and over again if the old map hasn’t been replaced (only the count).

Neat, huh? Not that much. Update is now locked: It needs to wait for all Lookups to finish before it has a chance to update the map. Gone with the wind are all the nice properties of lock-free data structures. In particular, it is easy to starve Update to death: Just look up the map at a high-enough rate—and the reference count never goes down to 1. So what you really have so far is not a WRRM (Write-Rarely-Read-Many) map, but a WRRMBNTM (Write-Rarely-Read-Many-But-Not-Too-Many) one instead.

Conclusion
Lock-free data structures are promising. They exhibit good properties with regards to thread killing, priority inversion, and signal safety. They never deadlock or livelock. In tests, recent lock-free data structures surpass their locked counterparts by a large margin [9]. However, lock-free programming is tricky, especially with regards to memory deallocation. A garbage-collected environment is a plus because it has the means to stop and inspect all threads, but if you want deterministic destruction, you need special support from the hardware or the memory allocator. In the next installment of “Generic,” I’ll look into ways to optimize WRRMMap such that it stays lock-free while performing deterministic destruction.

And if this installment’s garbage-collected map and WRRMBNTM map dissatisfied you, here’s a money saving tip: Don’t go watch the movie Alien vs. Predator, unless you like “so bad it’s funny” movies.

Acknowlegments
David B. Held, Michael Maged, Larry Evans, and Scott Meyers provided very helpful feedback. Also, Michael graciously agreed to coauthor the next “Generic“ installment, which will greatly improve on our WRRMap implementation.

References
[1] Alexandrescu, Andrei. Modern C++ Design, Addison-Wesley Longman, 2001.

[2] Alexandrescu, Andrei. “Generic:yasli::vector Is On the Move,” C/C++ Users Journal, June 2004.

[3] Butenhof, D.R. Programming with POSIX Threads, Addison-Wesley, 1997.

[4] Detlefs, David L., Paul A. Martin, Mark Moir, and Guy L. Steele, Jr. “Lock-free Reference Counting,” Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, pages 190-199, ACM Press, 2001. ISBN 1-58113-383-9.

[5] Gamma, Erich, Richard Helm, Ralph E. Johnson, and John Vlissides. Design Patterns: Elements of Resusable Object-Oriented Software, Addison-Wesley, 1995.

[6] Meyers, Scott and Andrei Alexandrescu. “The Perils of Double-Checked Locking.” Dr. Dobb’s Journal, July 2004.

[7] Maged, Michael M. “Scalable Lock-free Dynamic Memory Allocation,” Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 35-46. ACM Press, 2004. ISBN 1-58113-807-5.

[8] Robison, Arch. “Memory Consistency & .NET,” Dr. Dobb’s Journal, April 2003.

[9] Maged, Michael M. “CAS-Based Lock-Free Algorithm for Shared Deques,” The Ninth Euro-Par Conference on Parallel Processing, LNCS volume 2790, pages 651-660, August 2003.

清明节祭

Posted on 2017-04-04

我爸爸走了，走的那么快，就在我前年回京不久的后三天，我都来不及孝顺他，他就走了，我还记得我临走的时候，我爸爸在我二哥家的新房子里，正在用毛巾擦拭房子装修后留下来的污垢，他冲我笑了笑，然后说，“你走了”，我说，”嗯“，我跟他总是很少有言语，然后我坐上班车，头也不回的走了，看着渐行渐远的村子，慢慢的在我眼前消失。每次国庆我都会回去，我爸妈种了几亩田地，一到国庆，正是农忙时节，我会回去帮我爸妈收稻子，虽然我力气不大，但是能帮的就帮，家里条件不好，没有车，爸妈只能使蛮力拉着板车干农活，我一直在想跟我爸买一辆三轮车，但是终究还是没有买，因为我不想他们还这样忙碌下去，都忙碌了一辈子，为了我们几个孩子，操碎了心，我爸头发都白了，有时候都不忍心正眼看他，想想我还很小的时候，我爸伏在案头，记着账本，后来眼睛视力下降，戴着老花镜伏在案头记账本，那时候，我觉得我爸在我心中，是一个书生，但是终究是一个文弱书生，只能凭借自己仅有的力气，卖力干活，我不想他那么辛苦，但是终究他还是那么辛苦，即使是在走的那一天，他还是那么辛苦，我常常觉得我爸还活着，从没有离开，只是有时候想他了，拿起电话，看着他的号码，我才想起我爸走了，真的走了，我才失声痛哭，我还没来得及好好的看看他老人家，我还没来得及好好孝顺他老人家，我还没来得及带他来北京逛逛，我还没来得及陪他终老，带他看看这个世界，看看自己曾经去过没有去过的地方，在我爸的心里，他只想我，我能幸福就够了，他也许没想过他会去什么地方，他也许没想过，在他儿子的心里幸福是什么样子，我爸从来没有花我一分钱，还在为我攒钱，为这个家攒钱，他有三个儿子，我告诉他，我们几个养你就好，可是他还是终究日夜操劳，凌晨守班，爸爸吃完晚饭，就骑着摩托车到他单位去了，有时候睡在那里吃在那里，吃饭很简陋，我听我妈说，早上就煮碗清汤面就上班去了，那是什么样的日子呀！在单位吃饭，也没有买好点的菜，将就着吃，有时候想想我爸，我都很难以下咽，我爸从来没有吃过，都没有看过，生活如此，作为儿子的我是多么的痛心，往事不堪回首，我不知道他们是多么艰难的熬过来，他们倾其所有，把他们最最的爱给了他们的儿子，孙子，而牺牲了自己，多么的可悲，多么的愚蠢！爸，我真的不希望那么做，其实这个家并没有那么穷，只是你太爱你的儿子，媳妇，孙子，孙女了！我还没有结婚，也是我爸妈最担心的事，我一直拖着，不是不想，是因为房子太贵，我没有那么多的积蓄，靠自己买一套房子，工作五年，那有那么多钱买房子呢？想想爸爸为了我戒烟，那时候我还在上学，学费比较贵，我学费是我爸，我叔，我哥他们出的，虽然我爸赚的钱不多，但是为了我还是那么的辛苦，当时也没有那么心疼他，但是现在想来，他为我牺牲的太多！爸，我真的希望你还活着，健康的活着，我不想你再去赚钱，这个家对你来说，真的不是钱能解决的问题，我只想这个家是幸福的，快乐的，有钱花就够了，我不想牺牲你来换来那几个钱，不值得，爸，你在地下，儿子在地上，阴阳相隔，我希望你你能在那边开开心心，再也不要为了这个家，操碎心，你活着的时候从来没花我一分钱，儿子不孝，只能给你烧些纸钱，希望你能收到，爸，有时候想想什么时候能跟你一起吃饭喝酒，有说有笑，那是多么开心的事，简单而幸福，可是再也不能了！

RD or TDD ...

Posted on 2017-04-04

最近从做视频源站服务，开始转向做前端业务需求，也熟悉了下前端业务的代码以及基本的架构，因为最近需求少，来了之后基本没活，不经产生一种疑问？难不成我就这样一直等着？

开始在想，要不把code reviwe做好？优化下整体架构？对机器人模块进行重构？还是站在整个产品的角度审思下现有的架构设计和业务架构设计，然后对现有的业务进行重新的梳理？但是目前，线上业务是正常的，有必要这么做么？
于是产生了，是否做完RD，就应该开始做TDD？

没有一个规范的开发流程和开发思路，TDD也是敏捷开发中比较适用的一种方式，其基本的目标就是：
1.确保所有的需求都能被照顾到
2.在代码不断增加和重构的过程中，可以检查所有的功能是否正确
所以，也觉得是可以做的，也能彼此对自己的业务更加的熟悉，但是TDD有几个比较难以处理的问题就是：
1.测试范围的确定
2.关注测试而不是设计
3.TDD导致大量的Mock和Stub
4.Test Case并没有想像中的那么简单

送给一直在奋斗的亲们，也希望大家跟我交流。

迭代器一二

Posted on 2017-01-06

php当中有许多迭代器，但是日常的编程当中可能会很少使用，一方面是因为迭代的功能有限，只能在有限的对象集中使用，另一方面还是面向对象的思想的匮乏，下面看看php当中的一些迭代器

迭代器接口

interface Iterator {
    function rewind();
    function current();
    function key();
    function next();
    function valid();
}

php中的迭代器的应用

ArrayIterator implements ArrayAccess , SeekableIterator , Countable , Serializable
AppendIterator extends IteratorIterator implements OuterIterator
CachingIterator extends IteratorIterator implements OuterIterator , ArrayAccess , Countable 
CallbackFilterIterator extends FilterIterator implements OuterIterator
DirectoryIterator extends SplFileInfo implements SeekableIterator
EmptyIterator implements Iterator
FilesystemIterator extends DirectoryIterator implements SeekableIterator
abstract FilterIterator extends IteratorIterator implements OuterIterator
GlobIterator extends FilesystemIterator implements SeekableIterator , Countable
InfiniteIterator extends IteratorIterator implements OuterIterator
IteratorIterator implements OuterIterator
LimitIterator extends IteratorIterator implements OuterIterator
MultipleIterator implements Iterator
NoRewindIterator extends IteratorIterator 
ParentIterator extends RecursiveFilterIterator implements RecursiveIterator , OuterIterator
RecursiveArrayIterator extends ArrayIterator implements RecursiveIterator 
RecursiveCachingIterator extends CachingIterator implements Countable , ArrayAccess , OuterIterator , RecursiveIterator
RecursiveCallbackFilterIterator extends CallbackFilterIterator implements OuterIterator , RecursiveIterator
RecursiveDirectoryIterator extends FilesystemIterator implements SeekableIterator , RecursiveIterator
abstract RecursiveFilterIterator extends FilterIterator implements OuterIterator , RecursiveIterator
RecursiveIteratorIterator implements OuterIterator
RecursiveRegexIterator extends RegexIterator implements RecursiveIterator
RecursiveTreeIterator extends RecursiveIteratorIterator implements OuterIterator
RegexIterator extends FilterIterator

应用举例，遍历目录下面的所有指定后缀的文件

class  RecursiveFileFilterIterator extends FilterIterator{
	protected $ext = array('jpg','png','php');
	public function __construct($path){
		parent::__construct(new RecursiveIteratorIterator(new RecursiveDirectoryIterator($path)));
	}
	public function accept(){
		$item = $this->getInnerIterator();
		if($item->isFile() && in_array(pathinfo($item->getFilename(),PATHINFO_EXTENSION), $this->ext)){
			return true;
		}
	}
}
foreach(new RecursiveFileFilterIterator('/Users/kivmi/Documents') as $item){
	echo $item.PHP_EOL;
}

迭代设计模式

1.创建迭代器

 namespace platform;
class  PlatformIteratorBlock implements  \Iterator
{
    protected  $current = 0;
    protected  $platformList = null;
    public  function __construct(PlatformBlocks $platformList)
    {
        $this->platformList = $platformList;
        $this->current = $this->platformList->count() - 1;
    }
    public  function  current()
    {
        return $this->platformList->get($this->current);
    }
    public  function key()
    {
        return $this->current;
    }
    public  function next()
    {
        $this->current -- ;
    }
    public  function rewind()
    {
        $this->current = $this->platformList->count() - 1;
    }
    public  function valid( )
    {
        return null != $this->platformList->get($this->current);
    }
}

2.创建一个篮子

namespace platform;
class PlatformBlocks implements  \Countable
{
    protected  $platforms =  [];
    public  function get($n)
    {
            if(isset($this->platforms[$n]))
            {
                    return $this->platforms[$n];
            }
            return null;
    }
    public function add(PlatformBlock $platform)
    {
            $this->platforms[]  = $platform;
    }
    public function remove(PlatformBlock $platform)
    {
            foreach ($this->platforms as $k => $block)
            {
                 if($block->getName()  == $platform->getName())
                 {
                        unset($this->platforms[$k]);
                 }
            }
    }
    public function count()
    {
            return  count($this->platforms);
    }
}

3.从篮子里消费

$platformList = new PlatformListBlock();
$platformList->add(new NewsPlatformBlock());
$iterator = new PlatformIteratorBlock($platformList);
while($iterator->valid())
{
   foreach ($iterator->current()->getData() as $params)
   {
       self::doRequest($iterator->current()->getUrl(), $params);
   }
   $iterator->next();
}}