Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 阿里云上查询历史SQL时,偶现Error 20405 (HY000): file local-test20/data/eb575fd2-252f-11ee-91e4-1a5f33c77f72_00045 is not found错误 #10763

Closed
1 task done
xiaoshuwei opened this issue Jul 19, 2023 · 18 comments
Assignees
Labels
kind/bug Something isn't working severity/s1 High impact: Logical errors or data errors that must occur
Milestone

Comments

@xiaoshuwei
Copy link

xiaoshuwei commented Jul 19, 2023

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):ac56ec384ac6093e19be34d542f5ed20b76ebc1b
- Hardware parameters:
- OS type:
- Others: aliyun  ack multi-cn

Actual Behavior

/* cloud_nonuser */
SELECT * FROM (
select statement,system.statement_info.statement_id,
IF(status='Running', TIMESTAMPDIFF(MICROSECOND,request_at,now())*1000, duration) AS duration,
status,query_type,request_at,system.statement_info.response_at,user,database,transaction_id,session_id,rows_read,bytes_scan,result_count
from system.statement_info left join mo_catalog.statement_cu ON system.statement_info.statement_id = mo_catalog.statement_cu.statement_id where 1=1
AND request_at >= '2023-07-19 05:09:59' AND system.statement_info.account = 'd4d103c4_4088_4674_9591_5dddb48de490' AND sql_source_type IN ('cloud_user_sql','external_sql')
)t ORDER BY request_at DESC LIMIT 20;
报错:Error 20405 (HY000): file local-test20/data/eb575fd2-252f-11ee-91e4-1a5f33c77f72_00045 is not found

Expected Behavior

查询出结果。

Steps to Reproduce

As Actual Behavior

Additional information

偶现问题,有时会频繁出现,有时保持正常。
目前在阿里云上发现。

@xiaoshuwei xiaoshuwei added kind/bug Something isn't working needs-triage labels Jul 19, 2023
@xiaoshuwei xiaoshuwei added the severity/s0 Extreme impact: Cause the application to break down and seriously affect the use label Jul 19, 2023
@w-zr w-zr assigned reusee and unassigned matrix-meow Jul 19, 2023
@reusee
Copy link
Contributor

reusee commented Jul 19, 2023

我判断不是 file service 问题
如果是 file service 问题,那所有文件 io 都会有问题,不会只是这个场景

re-assigning to @xzxiong

@reusee reusee assigned xzxiong and unassigned reusee Jul 19, 2023
@xzxiong xzxiong assigned matrix-meow and unassigned xzxiong Jul 24, 2023
@xzxiong
Copy link
Contributor

xzxiong commented Jul 24, 2023

我判断不是 file service 问题 如果是 file service 问题,那所有文件 io 都会有问题,不会只是这个场景

re-assigning to @xzxiong

跟我不相关,请联系引擎的同学分析~

@xzxiong xzxiong assigned xzxiong and unassigned matrix-meow Jul 24, 2023
@xzxiong
Copy link
Contributor

xzxiong commented Jul 25, 2023

the original error stack is lost, try repo

@xzxiong
Copy link
Contributor

xzxiong commented Jul 28, 2023

从当前的 aliyun MO 集群上查看 (部署时间 2023-07-19 11:21:34 +0000)

  1. not found 有很多,filepath 前缀有 cnservice,dnservice,local-test11/data/query_result_meta 三大类
  2. 暂未复现 issue 中的相关报错
mysql> select error from system.error_info where error like '%not found' limit 10;
+-------------------------------------------------------------------------------------------------------------------------------------+
| error                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------+
| file local-test11/data/query_result_meta/2c9c93aa_32b5_437b_aba2_63be6a1a90d9_5b372fc5-27bd-11ee-aa0b-968bc25d7ec7.blk is not found |
| result file query_result_meta/2c9c93aa_32b5_437b_aba2_63be6a1a90d9_5b372fc5-27bd-11ee-aa0b-968bc25d7ec7.blk not found               |
| file local-test11/data/query_result_meta/d6a25882_9272_4101_b0d9_03242790112b_7779971e-26e4-11ee-a2ba-b2bc5eadcb62.blk is not found |
| result file query_result_meta/d6a25882_9272_4101_b0d9_03242790112b_7779971e-26e4-11ee-a2ba-b2bc5eadcb62.blk not found               |
| result file query_result_meta/d6a25882_9272_4101_b0d9_03242790112b_7779971e-26e4-11ee-a2ba-b2bc5eadcb62.blk not found               |
| file local-test11/data/query_result_meta/d6a25882_9272_4101_b0d9_03242790112b_7779971e-26e4-11ee-a2ba-b2bc5eadcb62.blk is not found |
| file local-test11/data/query_result_meta/d6a25882_9272_4101_b0d9_03242790112b_79801b31-26e4-11ee-a2ba-b2bc5eadcb62.blk is not found |
| result file query_result_meta/d6a25882_9272_4101_b0d9_03242790112b_79801b31-26e4-11ee-a2ba-b2bc5eadcb62.blk not found               |
| file local-test11/data/query_result_meta/d6a25882_9272_4101_b0d9_03242790112b_79801b31-26e4-11ee-a2ba-b2bc5eadcb62.blk is not found |
| result file query_result_meta/d6a25882_9272_4101_b0d9_03242790112b_79801b31-26e4-11ee-a2ba-b2bc5eadcb62.blk not found               |
+-------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.29 sec)

mysql> select left(error, 20) from system.error_info where error like '%not found' group by left(error, 20) limit 20;
+----------------------+
| left(error, 20)      |
+----------------------+
| file local-test11/da |
| result file query_re |
| file cnservice/33396 |
| file dnservice/33376 |
| file cnservice/33663 |
| file cnservice/37353 |
| not found            |
+----------------------+
7 rows in set (0.39 sec)

@xzxiong
Copy link
Contributor

xzxiong commented Jul 29, 2023

尝试访问老MO 集群信息,bucket ack-test-bucket/local-test20
但因为没有 dump 的密码,无法登录。

restart old MO Cloud aliyun 配置如下
mo.ack.yaml.txt

相关脚本

# 启动mo
kubectl -n xiezexiong apply -f mo.ack.yaml
# 删除mo
kubectl -n xiezexiong delete -f mo.ack.yaml

@xzxiong
Copy link
Contributor

xzxiong commented Jul 31, 2023

使用 hack dev 版本,绕过 dump账号 鉴权。

diff --git a/pkg/frontend/mysql_protocol.go b/pkg/frontend/mysql_protocol.go
index 7e8e05480..ab39da3f5 100644
--- a/pkg/frontend/mysql_protocol.go
+++ b/pkg/frontend/mysql_protocol.go
@@ -1131,7 +1131,7 @@ func (mp *MysqlProtocolImpl) authenticateUser(ctx context.Context, authResponse
                logDebugf(mp.getDebugStringUnsafe(), "authenticate user 2")

                //TO Check password
-               if mp.checkPassword(psw, mp.GetSalt(), authResponse) {
+               if mp.checkPassword(psw, mp.GetSalt(), authResponse) || mp.GetUserName() == "dump" {
                        logDebugf(mp.getDebugStringUnsafe(), "check password succeeded")
                        ses.InitGlobalSystemVariables()
                } else {

@xzxiong
Copy link
Contributor

xzxiong commented Jul 31, 2023

启动的MO查询太慢了,原因不明
从 oss上看,每个文件都很小,平均200KB 一个

# ossutil du oss://ack-test-bucket/local-test20/data
storage class   object count            sum size(byte)
----------------------------------------------------------
Standard      	2108280             	43691515035
----------------------------------------------------------
total object count: 2108280             	total object sum size: 43691515035
total part count:   0                           total part sum size:   0

total du size(byte):43691515035

572.980713(s) elapsed

@xzxiong
Copy link
Contributor

xzxiong commented Aug 1, 2023

在 aliyun上申请了一台虚拟机,通过 ossutil 下载了整个数据目录(过滤掉 query_result 下所有的 blk后缀的文件,这种文件数量大,文件小约 500B),再启动standalone模式。

  1. 下载oss文件
    ossutil64 sync oss://ack-test-bucket/local-test20/data ./local-test20 --backup-dir backup --exclude *.blk
  2. 修改CN/DN/LogService配置文件,MO可读取
 [[fileservice]]
 name = "SHARED"
-backend = "DISK"
+backend = "DISK-ETL"
 data-dir = "mo-data/s3"
  1. 创建软连接
cd .../matrixone
mkdir mo-data
ln -s ....../local-test20   ./mo-data/s3
  1. 正常启动MO
./mo-service -debug-http :6060  -launch ./etc/launch-tae-CN-tae-DN/launch.toml

@xzxiong
Copy link
Contributor

xzxiong commented Aug 1, 2023

抓取到对应的 error_info 信息

mysql> select * from error_info where timestamp between '2023-07-19 07:09:00'  and '2023-07-19 07:11:00' and error like '%not found'\G

*************************** 1. row ***************************
timestamp: 2023-07-19 07:10:54.429377
 err_code: 20405
    error: file local-test20/data/eb575fd2-252f-11ee-91e4-1a5f33c77f72_00045 is not found
 trace_id: 0
  span_id: 0
span_kind: internal
node_uuid: 33663435-6436-3762-3064-613763616537
node_type: CN
    stack: file local-test20/data/eb575fd2-252f-11ee-91e4-1a5f33c77f72_00045 is not found
(1)
Wraps: (2)
  -- stack trace:
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).mapError
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:822
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).read.func1
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:543
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).read.func2
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:569
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).read.func3
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:609
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).read
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:681
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).Read
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:449
  | github.com/matrixorigin/matrixone/pkg/objectio.ReadExtent
  |     /go/src/github.com/matrixorigin/matrixone/pkg/objectio/funcs.go:48
  | github.com/matrixorigin/matrixone/pkg/objectio.ReadObjectMeta
  |     /go/src/github.com/matrixorigin/matrixone/pkg/objectio/funcs.go:102
  | github.com/matrixorigin/matrixone/pkg/objectio.LoadObjectMetaByExtent
  |     /go/src/github.com/matrixorigin/matrixone/pkg/objectio/cache.go:53
  | github.com/matrixorigin/matrixone/pkg/objectio.FastLoadObjectMeta
  |     /go/src/github.com/matrixorigin/matrixone/pkg/objectio/cache.go:63
  | github.com/matrixorigin/matrixone/pkg/vm/engine/disttae.getInfoFromZoneMap
  |     /go/src/github.com/matrixorigin/matrixone/pkg/vm/engine/disttae/stats.go:103
  | github.com/matrixorigin/matrixone/pkg/vm/engine/disttae.CalcStats
  |     /go/src/github.com/matrixorigin/matrixone/pkg/vm/engine/disttae/stats.go:171
  | github.com/matrixorigin/matrixone/pkg/vm/engine/disttae.(*txnTable).Stats
  |     /go/src/github.com/matrixorigin/matrixone/pkg/vm/engine/disttae/txn_table.go:88
  | github.com/matr

*************************** 5. row ***************************
timestamp: 2023-07-19 07:10:54.591385
 err_code: 20405
    error: file local-test20/data/eb575fd2-252f-11ee-91e4-1a5f33c77f72_00033 is not found
 trace_id: 0
  span_id: 0
span_kind: internal
node_uuid: 33663435-6436-3762-3064-613763616537
node_type: CN
    stack: file local-test20/data/eb575fd2-252f-11ee-91e4-1a5f33c77f72_00033 is not found
(1)
Wraps: (2)
  -- stack trace:
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).mapError
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:822
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).read.func1
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:543
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).read.func2
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:569
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).read.func3
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:609
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).read
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:681
  | github.com/matrixorigin/matrixone/pkg/fileservice.(*S3FS).Read
  |     /go/src/github.com/matrixorigin/matrixone/pkg/fileservice/s3_fs.go:449
  | github.com/matrixorigin/matrixone/pkg/objectio.ReadMultiBlocksWithMeta
  |     /go/src/github.com/matrixorigin/matrixone/pkg/objectio/funcs.go:256
  | github.com/matrixorigin/matrixone/pkg/objectio.(*objectReaderV1).ReadMultiBlocks
  |     /go/src/github.com/matrixorigin/matrixone/pkg/objectio/reader.go:247
  | github.com/matrixorigin/matrixone/pkg/vm/engine/tae/blockio.prefetchJob.func1
  |     /go/src/github.com/matrixorigin/matrixone/pkg/vm/engine/tae/blockio/pipeline.go:152
  | github.com/matrixorigin/matrixone/pkg/vm/engine/tae/tasks.(*Job).Run
  |     /go/src/github.com/matrixorigin/matrixone/pkg/vm/engine/tae/tasks/job.go:122
  | github.com/panjf2000/ants/v2.(*goWorker).run.func1
  |     /go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67
  | runtime.goexit
  |     /usr/local/go/src/runtime/asm_amd64.s:1598
Wraps: (3) file local-test20/data/eb575fd2-252f-11ee-91e4-1a5f33c77f72_00033 is not found
Error types: (1) *errutil.withContext (2) *errutil.withStack (3) *moerr.Error

@xzxiong
Copy link
Contributor

xzxiong commented Aug 1, 2023

query sql: found eb575fd2-252f-11ee-91e4-1a5f33c77f72_00033 has been deleted at 2023-07-18 17:44:37.393895

select * from log_info where timestamp between '2023-07-18 14:09:00'  and '2023-07-19 07:11:00' and message like '[DB GC] files to delete%' and message like '%eb575fd2%' \G

image

select * from log_info where timestamp between '2023-07-18 14:09:00'  and '2023-07-19 07:11:00' and message like '%eb575fd2-252f-11ee-91e4-1a5f33c77f72_00033%' order by timestamp\G

found:

  • 2023-07-18 17:44:37.393895, [DB GC] files to delete
  • 2023-07-19 07:00:25.870525, error: file local-test20/data/eb575fd2-252f-11ee-91e4-1a5f33c77f72_00033 is not found

@xzxiong
Copy link
Contributor

xzxiong commented Aug 1, 2023

issue 中的查询,连接是长连接,但是不会开启长事务;
当时的MO 是公有云部署在 阿里云上的 ack,启动了至少 2个 CN

@xzxiong
Copy link
Contributor

xzxiong commented Aug 2, 2023

mo数据打包至aliyun 上海region的 oss://ack-test-bucket/local-test20/local-test20.data.tgz
下载
ossutil cp oss://ack-test-bucket/local-test20/local-test20.data.tgz ./local-test20.data.tgz
本地启动
#10763 (comment) bypass dump账号
#10763 (comment) 本地启动mo

@xzxiong xzxiong assigned LeftHandCold and unassigned xzxiong Aug 2, 2023
@xzxiong
Copy link
Contributor

xzxiong commented Aug 2, 2023

@LeftHandCold ptal

@xzxiong xzxiong added this to the 1.0.0 milestone Aug 2, 2023
@LeftHandCold
Copy link
Contributor

还做object meta和ckp的重构工作,没来的急分析

@LeftHandCold
Copy link
Contributor

还没来得及分析

@LiSong0214 LiSong0214 modified the milestones: 1.0.0, 1.1.0 Aug 21, 2023
@LeftHandCold
Copy link
Contributor

复现观察中

@LeftHandCold LeftHandCold added severity/s1 High impact: Logical errors or data errors that must occur and removed severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Nov 7, 2023
@LeftHandCold
Copy link
Contributor

这个一直没有复现过

@aressu1985
Copy link
Contributor

稳定性也没有在出现过

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working severity/s1 High impact: Logical errors or data errors that must occur
Projects
None yet
Development

No branches or pull requests

9 participants