1、使用GatewayWorker版本是v3.0.13,PHP7.4;
2、目前有7臺(tái)Gateway服務(wù)器,每臺(tái)規(guī)格是32C32G,每臺(tái)Gateway是96進(jìn)程數(shù),7臺(tái)Worker服務(wù)器,規(guī)格是16C16G,Worker是32進(jìn)程數(shù);
3、高峰期時(shí),所有Worker進(jìn)程都出現(xiàn)busy,那會(huì)沒有把Log截取下來,后面就直接升級(jí)機(jī)器配置;
4、升級(jí)后,status -d,偶爾會(huì)出現(xiàn)busy,大概4秒后就消失,strace -p 結(jié)果如下,這類busy是因?yàn)橛薪K端開了 status -d導(dǎo)致嗎?
14:50:34.311724 close(1022) = 0 <0.000054>
14:50:34.311908 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <0.167065>
14:50:34.479180 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:34.479303 write(6, "\f", 1) = 1 <0.000042>
14:50:34.479420 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000053>
14:50:34.479629 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000244>
14:50:35.480005 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:35.480083 write(6, "\f", 1) = 1 <0.000035>
14:50:35.480205 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000017>
14:50:35.480328 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000339>
14:50:36.480777 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:36.480845 write(6, "\f", 1) = 1 <0.000022>
14:50:36.480938 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000011>
14:50:36.481058 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000419>
14:50:37.481553 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:37.481605 write(6, "\f", 1) = 1 <0.000006>
14:50:37.481647 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000004>
14:50:37.481753 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000506>
14:50:38.482340 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:38.482411 write(6, "\f", 1) = 1 <0.000010>
14:50:38.482464 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000018>
14:50:38.482618 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.000276>
14:50:39.482990 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
14:50:39.483063 write(6, "\f", 1) = 1 <0.000011>
14:50:39.483136 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000005>
14:50:39.484465 munmap(0x7f5961600000, 2097152) = 0 <0.000033>
14:50:39.484553 munmap(0x7f5961400000, 2097152) = 0 <0.000014>
14:50:39.484586 munmap(0x7f5961200000, 2097152) = 0 <0.000013>
14:50:39.484627 munmap(0x7f5961000000, 2097152) = 0 <0.000044>
14:50:39.484740 munmap(0x7f5960e00000, 2097152) = 0 <0.000034>
14:50:39.484835 munmap(0x7f5960c00000, 2097152) = 0 <0.000057>
14:50:39.484986 munmap(0x7f5960a00000, 2097152) = 0 <0.000033>
14:50:39.485071 munmap(0x7f5960800000, 2097152) = 0 <0.000037>
14:50:39.485171 munmap(0x7f5960600000, 2097152) = 0 <0.000036>
14:50:39.485266 munmap(0x7f5960400000, 2097152) = 0 <0.000024>
14:50:39.485372 munmap(0x7f5960200000, 2097152) = 0 <0.000021>
14:50:39.485423 close(1023) = 0 <0.000040>
14:50:39.485520 close(1024) = 0 <0.000021>
14:50:39.513202 close(1352) = 0 <0.000032>
14:50:39.513270 close(1353) = 0 <0.000035>
14:50:39.513525 sendto(8, "G\0\0\0\26UPDATE `user` SET `logout_t"..., 75, MSG_DONTWAIT, NULL, 0) = 75 <0.000026>
14:50:39.513617 poll([{fd=8, events=POLLIN|POLLERR|POLLHUP}], 1, 86400000) = 1 ([{fd=8, revents=POLLIN}]) <0.000164>
14:50:39.513840 recvfrom(8, "\f\0\0\1\0l\1\0\0\0\0\2\0\0\0\0\27\0\0\2\3def\0\0\0\1?\0\f?"..., 32768, MSG_DONTWAIT, NULL, NULL) = 79 <0.000043>
14:50:39.513994 sendto(8, "\5\0\0\0\31k\1\0\0", 9, MSG_DONTWAIT, NULL, 0) = 9 <0.000019>
14:50:39.514105 sendto(8, "#\0\0\0\27l\1\0\0\0\1\0\0\0\0\1\375\0\375\0\n1746255039\7"..., 39, MSG_DONTWAIT, NULL, 0) = 39 <0.000048>
14:50:39.514248 poll([{fd=8, events=POLLIN|POLLERR|POLLHUP}], 1, 86400000) = 1 ([{fd=8, revents=POLLIN}]) <0.000888>
14:50:39.515209 recvfrom(8, "\7\0\0\1\0\1\0\2\0\0\0", 32768, MSG_DONTWAIT, NULL, NULL) = 11 <0.000023>
14:50:39.515335 alarm(0) = 24 <0.000016>
14:50:39.515413 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}, {events=EPOLLIN, data={u32=231, u64=231}}, {events=EPOLLIN, data={u32=233, u64=233}}, {events=EPOLLIN, data={u32=332, u64=332}}, {events=EPOLLIN, data={u32=330, u64=330}}, {events=EPOLLIN, data={u32=331, u64=331}}, {events=EPOLLIN, data={u32=333, u64=333}}, {events=EPOLLIN, data={u32=334, u64=334}}, {events=EPOLLIN, data={u32=335, u64=335}}, {events=EPOLLIN, data={u32=337, u64=337}}, {events=EPOLLIN, data={u32=336, u64=336}}, {events=EPOLLIN, data={u32=338, u64=338}}, {events=EPOLLIN, data={u32=339, u64=339}}, {events=EPOLLIN, data={u32=340, u64=340}}, {events=EPOLLIN, data={u32=341, u64=341}}, {events=EPOLLIN, data={u32=342, u64=342}}, {events=EPOLLIN, data={u32=345, u64=345}}, {events=EPOLLIN, data={u32=343, u64=343}}, {events=EPOLLIN, data={u32=344, u64=344}}, {events=EPOLLIN, data={u32=346, u64=346}}, {events=EPOLLIN, data={u32=348, u64=348}}, {events=EPOLLIN, data={u32=347, u64=347}}, {events=EPOLLIN, data={u32=350, u64=350}}, {events=EPOLLIN, data={u32=349, u64=349}}, {events=EPOLLIN, data={u32=351, u64=351}}, {events=EPOLLIN, data={u32=352, u64=352}}, {events=EPOLLIN, data={u32=353, u64=353}}, {events=EPOLLIN, data={u32=355, u64=355}}, {events=EPOLLIN, data={u32=354, u64=354}}, {events=EPOLLIN, data={u32=356, u64=356}}, {events=EPOLLIN, data={u32=358, u64=358}}, {events=EPOLLIN, data={u32=357, u64=357}}, ...], 1024, 626) = 196 <0.000128>
14:50:39.515663 read(5, "\f\f\f\f\f\f", 1024) = 6 <0.000022>
14:50:39.515733 read(5, 0x7f597c37a340, 1024) = -1 EAGAIN (Resource temporarily unavailable) <0.000030>
14:50:39.515830 recvfrom(231, "\0\0\0L\3\n\7\17\340u\311\n\7\t\305\213J\0\0\1\206\1\313\210\0\0\0\32a:1:"..., 65535, 0, NULL, NULL) = 152 <0.000018>
14:50:39.515950 alarm(30) = 0 <0.000015>
14:50:39.516009 write(1, "[2025-05-03 14:50:39]{\"event\":\"w"..., 44) = 44 <0.000018>
后面停用status -d ,使用 strace -p -o > xx.log 的方式觀察,記錄停頓時(shí)間,同時(shí)查看 status,發(fā)現(xiàn)是busy,幾次基本都是這個(gè)片段select卡住,請(qǐng)問這timeout是什么情況
15:30:15.776606 close(1022) = 0 <0.000028>
15:30:15.776759 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler) <1.630117>
15:30:17.407135 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=3285, si_uid=1000} ---
15:30:17.407220 write(6, "\f", 1) = 1 <0.000020>
15:30:17.407287 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000012>
15:30:17.407410 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = 0 (Timeout) <5.005080>
15:30:22.413499 close(1024) = 0 <0.000059>
15:30:22.413645 close(1025) = 0 <0.000044>
觀察后,補(bǔ)充個(gè)現(xiàn)象:
目前是7臺(tái)Gateway,每臺(tái)96進(jìn)程,每臺(tái)Worker服務(wù)的單進(jìn)程需要打開672連接,連上每臺(tái)gateway的端口;
觀察期間的strace(截取片段),發(fā)現(xiàn)有周期性的再次打開新的672個(gè)連接(對(duì)應(yīng)strace日志的端口數(shù)量1353,數(shù)量接近),而后又close掉這些剛剛打開的新鏈接,這期間就會(huì)產(chǎn)生busy;
17:07:14.133450 fcntl(1352, F_GETFL) = 0x2 (flags O_RDWR) <0.000013>
17:07:14.133493 fcntl(1352, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000023>
17:07:14.133581 connect(1352, {sa_family=AF_INET, sin_port=htons(30139), sin_addr=inet_addr("10.7.113.155")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000024>
17:07:14.133655 poll([{fd=1352, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 3000) = 1 ([{fd=1352, revents=POLLOUT}]) <0.000023>
17:07:14.133742 getsockopt(1352, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000024>
17:07:14.133813 fcntl(1352, F_SETFL, O_RDWR) = 0 <0.000021>
17:07:14.133898 sendto(1352, "\0\0\0#\17\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0077979"..., 35, 0, NULL, 0) = 35 <0.000071>
17:07:14.134023 socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 1353 <0.000030>
17:07:14.134090 fcntl(1353, F_GETFL) = 0x2 (flags O_RDWR) <0.000020>
17:07:14.134155 fcntl(1353, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000017>
17:07:14.134202 connect(1353, {sa_family=AF_INET, sin_port=htons(30155), sin_addr=inet_addr("10.7.113.155")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000042>
17:07:14.134293 poll([{fd=1353, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 3000) = 1 ([{fd=1353, revents=POLLOUT}]) <0.000027>
17:07:14.134365 getsockopt(1353, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000014>
17:07:14.134443 fcntl(1353, F_SETFL, O_RDWR) = 0 <0.000019>
跳過..
17:07:19.695366 close(1347) = 0 <0.000030>
17:07:19.695442 close(1348) = 0 <0.000021>
17:07:19.695489 close(1349) = 0 <0.000032>
17:07:19.695549 close(1350) = 0 <0.000031>
17:07:19.695611 close(1351) = 0 <0.000025>
17:07:19.695658 close(1352) = 0 <0.000022>
17:07:19.695717 close(1353) = 0 <0.000042>
問題大致可以定位到這,正常是681u,不清楚是內(nèi)部機(jī)制還是什么問題,導(dǎo)致進(jìn)程打開新的鏈接(超過定義的數(shù)量)
你們整個(gè)集群gateway進(jìn)程太多了,導(dǎo)致集群內(nèi)部出現(xiàn)大量?jī)?nèi)部通訊,影響了性能。
安裝event擴(kuò)展并優(yōu)化linux內(nèi)核后,每個(gè)gateway進(jìn)程能輕松處理支持1-3萬的連接。所以要預(yù)估好整個(gè)系統(tǒng)支持的并發(fā)量,設(shè)置合理的gateway進(jìn)程數(shù)。比如10萬連接,整個(gè)集群(注意是整個(gè)集群)的geteway進(jìn)程設(shè)置為4-6個(gè)gateway進(jìn)程就好了,多了反而影響性能。
我們有一個(gè)物聯(lián)網(wǎng)系統(tǒng),大概用了3臺(tái)4核4G服務(wù)器支撐了10萬的在線,整個(gè)系統(tǒng)負(fù)載還極低。
所以你們系統(tǒng)用7臺(tái)32核跑Gateway進(jìn)程太浪費(fèi)了,每臺(tái)96 Gateway進(jìn)程太離譜了。
你們可以通過以下策略設(shè)置服務(wù)器數(shù)Gateway進(jìn)程數(shù),假設(shè)前期預(yù)期最高4萬在線,就使用一臺(tái)2核2G服務(wù)器設(shè)置2個(gè)Gateway進(jìn)程數(shù),以后每增加4萬在線就增加一臺(tái)2核2G 2gateway進(jìn)程的gateway服務(wù)器。這樣可以做到隨時(shí)擴(kuò)容,性能又好??傮w原則,gateway進(jìn)程夠用就行,不要開多了。
Worker服務(wù)器和進(jìn)程數(shù)看業(yè)務(wù)情況,一般配置成CPU*4
即可,例如16核配置成64個(gè)進(jìn)程。
追蹤Worker的busy進(jìn)程經(jīng)常會(huì)發(fā)生:
10:24:35.955020 select(1024, [], [], [], {tv_sec=5, tv_usec=0}) = 0 (Timeout)
也是因?yàn)間ateway太多導(dǎo)致么?
另外你的php版本也有問題,7.4.1,小版本號(hào)是1,這說明是php7.4的第一個(gè)版本。不要用小版本號(hào)太小的PHP版本,小版本號(hào)越低,PHP自身的bug越多。用成熟的版本,例如7.4.33
老大,一個(gè)gateway進(jìn)程維護(hù)2萬個(gè)鏈接,如果是消息交互比較頻繁的應(yīng)用的話,這種是直接升級(jí)服務(wù)器配置,按一核一個(gè)gateway,還是直接增加一臺(tái)2核2G 2gateway進(jìn)程的gateway服務(wù)器呢?
理論上都可以。
如果升級(jí)服務(wù)器配置需要重啟gateway,那么數(shù)萬個(gè)客戶端一起重連可能會(huì)導(dǎo)致系統(tǒng)過載,所以動(dòng)態(tài)加服務(wù)器好一些。