微博看到一篇比较不错的google工程师优化字符串操作的文章

February 16, 2013 | 1 Minute Read

Automated Locality Optimization Based on the

Reuse Distance of String Operations

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40679.pdf





Abstract—String operations such as memcpy, memset and

memcmp account for a nontrivial amount of Google datacenter

resources. String operations hurt processor cache ef?ciency when

the data accessed is not reused shortly thereafter. Such cache

pollution can be avoided by using nontemporal memory access

to bypass L2/L3 caches. As reuse distance varies greatly across

different memcpy static call contexts in the same program, an

ef?cient solution needs to be call context sensitive. We propose a

novel solution to this problem using the page protection mechanism to measure reuse distance and the GCC feedback directed

optimization mechanism to generate nontemporal memory access

instructions at the appropriate static code contexts. First, the

compiler inserts instrumentation for calls to string operations.

Then a run time library measures reuse distance using the page

protection mechanism during a representative pro?ling run. The

compiler ?nally generates calls to specialized string operations

that use nontemporal operations for the arguments with large

reuse distance. We present a full implementation and initial

results including speedup on large datacenter applications.

Index Terms—memcpy; nontemporal; reuse distance



1.先是用mprotect和 page fault handler记录下 memcpy的操作信息。(上次看到有人用这个办法来调试c++的内存的越界访问错误)



2. 然后根据记录下来的测试信息,用gcc的 Feedback-Directed Optimizations特性生成优化代码。

参考:

Feedback-Directed Optimizations in GCC with Estimated Edge Pro?les

from Hardware Event Sampling

http://bbs.chinaunix.net/thread-1928590-1-1.html

http://ols.fedoraproject.org/GCC/Reprints-2008/ramasamy-reprint.pdf



gcc应该是可以根据profile的信息来生成优化代码?



3. 主要是根据memcpy的内存访问的距离,看是不是采用不采用 nontemporal memory access策略。

论文的一开始对这个做了详细的介绍可以看一下。大概是说不同的cache策略可能对字符串操作性能影响很大,memcpy导致flush了cpu的全部内存cache,而这个cache区域有没有马上用到的话,以后用到的话,cpu中间的cache加载的重复刷写影响性能。

这个优化都做到cpu内存的cache的更新策略上去了!Google的优化也太深入了吧。

主要是prefetchnta 和movntq 汇编指令,prefetchnta进行数据预先读取, movntq进行WC(Write-combining)模式 操作。

微博有人给了优化过的memcpy一个参考 

内存拷贝的优化方法(草稿) [2]   http://www.cnblogs.com/flier/archive/2004/07/08/22352.html