The direct appearance is ANR: Input event dispatching timed out sending. This ANR is caused by mediaplayer process.
Firstly, we use top
to watch CPU loading, but find that CPU loading is not high when playback.
Secondly, check /proc/interrupts
, we cannot find any interrupt abnormal.
This is a customer on-site issue, we only digg the log. We add some debug logs, and find that sometimes the interval printing two logs is 11s,it’s weird.
We doubt that system scheduling maybe occur anomaly. So capture ftrace data.
start ftrace
|
|
stop ftrace (stop ftrace immediately once be reproduced)
|
|
Notes: How to make ANR time printed by logcat
to match Unix Timestamp. There is a very good tool: http://rimzy.net/tools/php_timestamp_converter.php
Go through digging ftrace data, we found a doubtful point: process X hasn’t scheduled about 11s.
renice
and tasklet
to improve process priority and bind process X to specific CPU, it’s no improvement.sched_wakeup
from idle process, it means process X is waked up by interrupt. The latest interrupt is uart interrupt.At last, we review n_tty_write() function.
Write data in while loop. When user space process uses blocking write method, if write fail, it will yield cpu and wake up untill the condition is met. In this ANR case, the root cause is that user space process instant log data is huge and block in n_tty_write().
]]>Kernel: linux3.10
发生概率性死机:
mm/memory.c:399: bad pmd 15141312
Segmentation fault
BUG: Bad rss-counter state mm:ce964380 idx:0 val:5
BUG: Bad rss-counter state mm:ce964380 idx:1 val:1
对发生Segmentation fault的user space process注册signal handler, 发现死机时并没有进入到signal handler中。在kernel __do_uesr_fault
中判断SIGSEGV时dump register, rebuild出backtrace, 发现每次都不一样,结合汇编code来看,没有发现可疑的地方。
bad pmd的提示比较怀疑该process的地址空间的pgd已经出现问题,然后再发生的SIGSEGV.
因为是烧机问题,所以利用hw breakpoint来监控process的task_struct的mm->pgd.
hw breakpoint config
参考samples/hw_breakpoint/data_breakpoint.c
的code对出问题的process的mm->pgd添加写监控。
|
|
然后再重新烧机复制,果然发现pgd有被盖写,通过打印的backtrace即可锁定凶手。
user space如何使用hw breakpoint ?
]]>CPU: ARMv7
开机到kernel某个固定阶段发生死机,死机信息都是imprecise external abort.
Unhandled fault: imprecise external abort (0x1c06) at 0x7cab1234
imprecise external abort比较少见,一般来讲abort的时候已经是滞后性的了,也就是说abort仔细check打印的backtrace, 都看不出任何的问题。
先看下kernel中打印imprecise external abort
的地方。
arch/arm/mm/fsr-2level.c
arch/arm/mm/fault.c
do_DataAbort
的第二个参数fsr很有参考价值,是fault status register
, 第一个参数addr是fault address register
.
这2个register的具体含义可以查阅arm trm.
这个时候fault address register记录的并不一定是出错的地址。查看下fsr 0x1c06的意思是什么,对比register description.
Table 4-226 DFSR bit assignments for Short-descriptor translation table format
Bits | Name | Function |
---|---|---|
[31:14] | - | Reserved, RES0. |
[13] | CM | Cache maintenance fault. For synchronous faults, this bit indicates whether a cache maintenance operation generated the fault: 0 Abort not caused by a cache maintenance operation. 1 Abort caused by a cache maintenance operation. |
[12] | ExT | External abort type. This field indicates whether an AXI Decode or Slave error caused an abort: 0 External abort marked as DECERR. 1 External abort marked as SLVERR. For aborts other than external aborts this bit always returns 0. |
[11] | WnR | Write not Read bit. This field indicates whether the abort was caused by a write or a read access: 0 Abort caused by a read access. 1 Abort caused by a write access. For faults on CP15 cache maintenance operations, including the VA to PA translation operations, this bit always returns a value of 1. |
[10] | FS[4] | Part of the Fault Status field. See bits [3:0] in this table. |
[9] | - | RAZ. |
[8] | - | Reserved, RES0. |
[7:4] | Domain | Specifies which of the 16 domains, D15-D0, was being accessed when a data fault occurred. For permission faults that generate Data Abort exception, this field is UNKNOWN. ARMv8 deprecates any use of the domain field in the DFSR. |
[3:0] | FS[3:0] | Fault Status bits. This field indicates the type of exception generated. Any encoding not listed is reserved. |
FS[3:0] :
0b00001 Alignment fault.
0b00010 Debug event.
0b00011 Access flag fault, section.
0b00100 Instruction cache maintenance fault.
0b00101 Translation fault, section.
0b00110 Access flag fault, page.
0b00111 Translation fault, page.
0b01000 Synchronous external abort, non-translation.
0b01001 Domain fault, section.
0b01011 Domain fault, page.
0b01100 Synchronous external abort on translation table walk, first level.
0b01101 Permission fault, section.
0b01110 Synchronous external abort on translation table walk, second level.
0b01111 Permission fault, second level.
0b10000 TLB conflict abort.
0b10101 LDREX or STREX abort.
0b10110 Asynchronous external abort.
0b11000 Asynchronous parity error on memory access.
0b11001 Synchronous parity error on memory access.
0b11100 Synchronous parity error on translation table walk, first level.
0b11110 Synchronous parity error on translation table walk, second level.
还是不知道出错的地方在哪里。这种imprecise external abort可能是BUS error, 想到这款IC有bus monitor的功能,check bus记录的发生abort的register, 还真记录下一个写DRAM address发的生abort.
进一步check发现写这个DRAM address其实是在很早之前的uboot阶段。写的DRAM address超出了DRAM size而导致的问题。将其fix掉,则没有了imprecise external abort, 可以正常开机了。
那么为什么在uboot阶段没有及时abort呢? 因为uboot阶段CPSR.A是mask的,如果将uboot阶段CPSR.A改成unmask, 然后再复现此问题,那么uboot阶段就会比较及时地收到abort, 进入异常向量的abort处理流程。
[1] https://community.arm.com/thread/5622
[2] http://stackoverflow.com/questions/27507013/synchronous-external-abort-on-arm
An abort means the CPU tried to make a memory access, which for whatever reason, couldn’t be completed so raises an exception.
An external abort is one from, well, externally to the processor, i.e. something on the bus. In other words, the access didn’t fault in the MMU, went out onto the bus, and either some device or the interconnect itself came back and said “hey, I can’t deal with this”.
A synchronous external abort means you’re rather fortunate, in that it’s not going to be utterly hideous to debug - in the case of a prefetch abort, it means the IFAR is going to contain a valid VA for the faulting instruction, so you know exactly what caused it. The unpleasant alternative is an asynchronous external abort, which is little more than an interrupt to say “hey, something you did a while ago didn’t actually work. No I don’t know what is was either.”
[3] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/14809.html
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2011-November/072495.html
]]>
|
|
然后运行fluxgui, 勾选上”Autostart f.lux indeicator applet”,填写好所在地的经纬度,设置完成。
此时发现flux并没有正常work. 还需要安装flux.
上面安装的仅仅是一个GUI, 核心的flux还需要重新安装下。
64bit版本:xflux64.tgz
32bit版本:xflux32.tgz
然后解压,将解压出来的xflux拷贝到/usr/bin/下即可。
再设置fluxgui,即可生效。
Windows 10 Build14316开始支持ubuntu bash.
程序和功能 -> 启用或关闭Windows功能 -> 选中”Windows Subsystem for linux(Beta)”
在命令行窗口中输入bash, 即可下载bash并安装使用。
bash shell是作为Windows一个subsystem: WSL(Windows Subsystem for Linux)
The Performance Of Ubuntu Software Running On Windows 10 With The New Linux Subsystem
]]>vim -u ~/.vimrc.go
即可,如果嫌麻烦,可以设置alias.
|
|
vim-as-golang-ide实际上用到的仍然是vim-go. vim-as-golang-ide的好处时不破坏系统vim的设置。
vim-go: https://github.com/fatih/vim-go
执行完vim -u ~/.vimrc.go
出现如下错误。
可在~/.vimrc.go
中进行如下设置。
在进行:GoInstallBinaries
之前需要临时设置$GOBIN
环境变量,以便vim-go需要的binary放在/usr/local/go/bin
下。
Please be sure all necessary binaries are installed (such as gocode, godef, goimports, etc.). You can easily install them with the included :GoInstallBinaries command. If invoked, all necessary binaries will be automatically downloaded and installed to your $GOBIN environment (if not set it will use $GOPATH/bin). Note that this command requires git for fetching the individual Go packages. Additionally, use :GoUpdateBinaries to update the installed binaries.
– https://github.com/fatih/vim-go
vim-go依赖于很多其他binary,需自备梯子。
国内有个Go Package Manager: https://gopm.io/ 可以下载到被墙的binaries.
Golang安装参考https://golang.org/doc/install
环境变量设置,将如下设置写到~/.profile文件中:
GOROOT是go的安装路径
package runtime中:
/src/runtime/extern.go:213
GOPATH是go的工作目录
package build中:
/src/go/build/build.go:204
GOBIN指向go安装目录中bin的位置
package main中:
/src/cmd/go/build.go:742
go env可查看更多的环境变量。
GO15VENDOREXPERIMENT
package main
/src/cmd/go/pkg.go:273
hello.go
Refer to “Go Playground” https://play.golang.org/p/F8Ev-6husG
go run是集编译,链接,运行于一体。运行完之后在当前目录下看不到任何中间文件和最终的可执行文件
go build是编译,链接。执行完之后可以在当前目录下看见可执行程序hello. 使用-work参数可以生成临时文件。
Go管理Project的方法比较特别,没有工程文件,而是使用目录结构和包名来推导工程结构和构建顺序,所以Go工程的目录结构和包名就很讲究,必须符合规定。
以一个加减乘除计算器为例。
目录结构如下:
一般Go工程都会包含bin, pkg, src 3个目录。bin和pkg可以先不创建,go命令可以自动创建(比如 go install)。src
目录顾名思义,是源码文件,Go源文件以package方式组织,新建一个package,就是在src
下新建一个文件夹。
如上面tree所示,src
下有calc
和littlemath
2个文件夹,即有2个package,一个littlemath package, littlemath下的4支文件中的package名称最好和目录名称保持一致,如果不一致就会比较麻烦,容易让人产生混淆,后面会说明package名称和目录名称不一致的情形。
另一个calc
文件夹,是main package,calc.go
中有表明是main package.
构建之前需要设置此工程的GOPATH. 每个工程都需要设置GOPATH环境变量,感觉还是有点小麻烦的。编辑~/.profile
,将littlecalc路径添加到GOPATH
中,然后source ~/.profile
.
go build
来构建工程执行如下操作即可在bin
目录下看到生成的可执行文件calc.
使用-x
参数查看build中间过程。
先会创建临时目录才存放build中间结果,真正进行编译的是compile
命令,链接的是link
命令。最终build的可执行文件从临时目录中移到当前工作目录下。
执行完go build
的目录结构如下。
go install
来构建工程
|
|
得到的结果如下。
发现最终生成的结果想要拷贝到/usr/local/go/bin
下,因为当前时非root用户,没有权限执行此操作。奇怪的是,为什么不是往当前工作目录的bin
下拷贝呢?原来是$GOBIN
环境变量导致的。前面将$GOBIN
设置成了$GOROOT/bin
, 重新设置环境变量如下。
然后再执行go install calc
,可以看到目录树如下。在bin
下多了一个叫calc
的最终可执行文件;在pkg下多了一个package文件littlemath.a
.
go build
和go install
的区别其实从上述-x
参数得到的结果也可以粗略看出二者的区别。go install
会创建bin
和pkg
,会将编译出所依赖的package放在pkg中,将最终的可执行文件放在bin中,这个bin的具体位置受到$GOBIN环境变量的影响。
对上面小工程的目录树结构稍作改变,将littlemath
改名成mymath
.
mymath
中的源文件中package名字仍然保留littlemath.
calc/calc.go
import的package也保留littlemath.
然后go install -x calc
会提示找不到littlemath package.
但将calc/calc.go
中import的package改成mymath
,则可编译成功。
go install -x calc
以上可以发现,编译产生的静态包(package)文件是以目录名来命名的。import时应该是目录名,而在引用包时则需要包名。
虽然将littlemath
改成了mymath
,calc/calc.go
中的import "littlemath"
改成了import "mymath"
,
但是mymath
下的源文件中仍然定义的是package littlemath
,calc/calc.go
中引用包中的函数仍然是类似于little.Add()
这样的。
来自于Go项目的目录结构。
与src
平行路径新建一支build.sh
文件,内容如下。
使用方法:sh build.sh [packages]
,如sh build.sh calc
.
calc源码:
https://github.com/magicse7en/go-practice/commit/668cd75c498bfd1c8542eeefb8547c53dd2e7cde
修改包名与目录名不一致的源码:
https://github.com/magicse7en/go-practice/commit/fce999305dd0c025493a4e3282379291c8d8f69e
build.sh
想法是每次写博客,只需要push md文件及博客所需的资源文件即可。Travis CI持续集成tool可以满足此需求。
登录github, settings -> Personal access tokens -> Generate new token
填写token description,比如叫hexo deploy.
勾选上授予的权限,比如我勾选的是repo和gist,然后create.
将产生的token串复制保留下来,后面会使用到,如果丢失,只能重新产生。
设置Environment Variables: 取名为“DEPLOY_REPO”,将上一步中复制的token粘贴到此处,关掉“Display value in build log”选项。添加完之后如下图:
check token是否有效
|
|
在hexo博客的repository上新建一个branch “raw”用于保存md文件及资源文件,主题文件等。
|
|
新建.travis.yml文件,然后push到raw branch
refer to: https://github.com/magicse7en/magicse7en.github.io/blob/raw/.travis.yml
注意branch要设置成only raw:
|
|
|
|
打开travis-ci.org, 能够发现正在构建,可以check build log, 看看是否build OK.
如果build OK, 可以打开博客首页check新post的博客有无成功。
psersonal token问题,重新产生,并使用travis whoami判断token有效之后再配置travis CI environment variable
在本地deploy并没有发生此问题,在travis vm中出现此问题,解决方式是在.travis.yml中增加
将主题换回默认的landscape则可以正常显示内容。则锁定是next theme配置问题,check发现themes/next 中的内容被ignore了,并没有push到raw branch.
解决方法有二:
使用.gitmodules,该方法会直接将next theme repository import进来,这样的好处是可以使用最新的next theme,坏处是没法客制化自己的主题配置文件
|
|
删除themes/next的.git和.gitignore,然后就可以讲themes/next的内容push到repository中了。
在.travis.yml中将node_modules添加到cache中,可以加快构建速度
|
|
如果想在github的README.md显示构建成功与否的标示,可以修改README.md:
|
|
借用一张图说明Travis CI自动构建hexo博客的流程:
CPU: ARMv7
Kernel: 3.10.26
最近把压缩kernel的算法由gzip改成lzo,在boot自解压kernel阶段CPU会abort.
先check压缩kernel的算法是否已经是lzo了,check arch/arm/boot/compressed下已经有piggy.lzo文件了。check系统从flash中load的zImage也是正确的,自解压kernel之前位于DRAM中的zImage的data也是正确的。出现crash之后,再去check DRAM中zImage data,发现并没有发现变化,说明zImage src data是正确的,并没有被盖写,只能是在自解压过程中出现问题了。
debug发现每次都是解压固定的某块数据时出错,C code是位于include/linux/unaligned/le_struct.h
对应的具体出错的汇编指令是
出错时r8的值是奇数,比如是0x04000003,于是怀疑是对齐问题。check了下kernel config,也没有发现漏掉了跟对齐相关的config. 查看DFAR和DFSR register,发现Fault Status bits是0x1, 对照ARM手册,就是alignment fault.
FS[3:0] Fault Status bits. This field indicates the type of exception generated. Any encoding not listed is reserved:
0b00001 Alignment fault.
0b00010 Debug event.
0b00011 Access flag fault, section.
0b00100 Instruction cache maintenance fault.
0b00101 Translation fault, section.
0b00110 Access flag fault, page.
0b00111 Translation fault, page.
0b01000 Synchronous external abort, non-translation.
0b01001 Domain fault, section.
0b01011 Domain fault, page.
0b01100 Synchronous external abort on translation table walk, first level.
0b01101 Permission fault, section.
0b01110 Synchronous external abort on translation table walk, second level.
0b01111 Permission fault, second level.
0b10000 TLB conflict abort.
0b10101 LDREX or STREX abort.
0b10110 Asynchronous external abort.
0b11000 Asynchronous parity error on memory access.
0b11001 Synchronous parity error on memory access.
0b11100 Synchronous parity error on translation table walk, first level.
0b11110 Synchronous parity error on translation table walk, second level.
进一步debug, 发现在解压过程中上述指令中的r8值经常会出现奇数,也没有发生crash,为什么唯独到了解压某个固定的block时就会出问题呢?于是乎怀疑出问题时的这段memory跟其他memory属性不一样,check MMU table,果不其然,crash时指令访问的memory的属性是outer, Device的,而其他段mapping的memory属性是可读可写的。
接下来就来check 自解压kernel时MMU table是何时打开的?何时mapping的?
arch/arm/boot/compressed/head.S
这段汇编code设定了mapping的属性。
uImage中设定zImage的execute address不合适导致的上述code设定的MMU属性不对。
]]>
|
|
然后配置username和email
|
|
如果没有Github账户的话,则注册一个,将.ssh/id_rsa.pub中的内容复制到Github的Settings-> SSH Keys-> New SSH Key
会提示
The authenticity of host ‘github.com (207.97.227.239)’ can’t be established.RSA key fingerprint is 16:27:ac:a5:76:28:2d:36:63:1b:56:4d:eb:df:a6:48.Are you sure you want to continue connecting (yes/no)?
输入yes就好,然后会提示:
Hi xxx! You’ve successfully authenticated, but GitHub does not provide shell access.
名字必须是GithubId.github.io
Nodejs官网下载tarball->解压->创建软链接
docs: https://hexo.io/docs/
配置文件
将_config.yml中type改成如下:
然后部署,即可浏览https://githubid.github.io ,博客页面出现了。
新建博客:
然后可以使用Markdown语法编辑source/_posts/hell-hexo.md
编辑完成之后,使用hexo generate产生,然后可以启动server: hexo server,本地浏览页面效果。
满意之后可以直接部署到github:
hexo deploy
每次deploy前最好clean一下
生成和部署可以直接使用
有选择困难症太纠结,最后还是选择了Next主题,配置手册: [http://theme-next.iissnan.com]
]]>
|
|
More info: Writing
|
|
More info: Server
|
|
More info: Generating
|
|
More info: Deployment
]]>