Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tech Request]: Agg and Window function framework #9231

Open
fengttt opened this issue May 3, 2023 · 11 comments
Open

[Tech Request]: Agg and Window function framework #9231

fengttt opened this issue May 3, 2023 · 11 comments
Assignees
Labels
kind/tech-request New feature or request priority/p0 Critical feature that should be implemented in this version source/dev issues from devs team/c2
Milestone

Comments

@fengttt
Copy link
Contributor

fengttt commented May 3, 2023

Why do you want to refactor this code?

Aggregate functions and Window functions should also be correctly refactored.   Just like ordianry function has a context and a Fn ptr, Agg function needs the following fn ptrs.

init -- initialized context and initial state value.
add  -- add values to agg function
save -- save current context and state value,
merge -- merge saved state values
finalize -- compute the final agg value

Let's use avg as an example.
init -- state has 2 values, sum = 0 and count = 0
add  -- sum += arg, count += 1
save -- save sum and count some data structure
merge -- sum += other.sum, count += other.count
finalize -- result = sum / count

Agg and Window functions are fundementally different from scalar function so they should just be a different framework.

Describe the solution you'd like

Window function also need its new framework, and not as clear as agg, but here is something to get started.
We assume window does not need save/merge (one window function/frame will execute at one node.)

init -- init context and state
transit -- add a new value
-- note that transit may output some rows, 0 or many
finalize -- output, 0 or many rows.

@iamlinjunhong @gavinyue please take a look at this one.

Describe alternatives you've considered

No response

Additional information

No response

@fengttt fengttt added the priority/p0 Critical feature that should be implemented in this version label May 3, 2023
@fengttt fengttt added this to the 1.0.0 milestone May 3, 2023
@m-schen
Copy link
Contributor

m-schen commented Jul 7, 2023

暂无计划,如果1.0只有1个多月的话,可能没时间做这个。

@m-schen
Copy link
Contributor

m-schen commented Aug 11, 2023

今天开始先做点聚合代码的代码清理工作。

计划先删除以下接口:Dup, GetInputTypes, WildAggRealloc等。不过需要先看看怎么删。

@m-schen
Copy link
Contributor

m-schen commented Aug 15, 2023

image 暂时计划将聚合接口改成图示样子,以及将具体实现里的vs,es等属性替换成用vector维护。 修改完成后,聚合的开发主要需要关注以下四个方法: grow函数:增加group组数 merge函数:合并两个聚合结构 fill函数:根据传入的值更新聚合 eval函数:返回聚合结果

不过当前不做。目前会先删多余代码和改bug。

@m-schen
Copy link
Contributor

m-schen commented Aug 17, 2023

count聚合可以自己单独走一个逻辑,不需要走通用的单列聚合框架。这个后面单独做。不和这次改bug一起弄。
可以提高一些性能和减少函数调用。

@fengttt
Copy link
Contributor Author

fengttt commented Aug 19, 2023

9e013fa
In this PR, the way we copy the string is bad, really bad. The refactor need to take care of fixing this as well.

@m-schen
Copy link
Contributor

m-schen commented Aug 21, 2023

已修复旧框架没有null值处理错误和const vector的问题。
本周会去除聚合框架自己维护自己一套id等问题,改成统一走函数框架。以及恢复batch fill接口来提高fill性能。

后续计划是:1. 去除vs, es, srcs等,改成采用vector维护。 2. 移除wild agg reallocate等诡异接口,减少二次分配。

@m-schen
Copy link
Contributor

m-schen commented Aug 22, 2023

太难搞了 1.0搞不完 循环引用和范型的问题太多了 很难把接口抽象出来。

@m-schen m-schen modified the milestones: 1.0.0, 1.1.0 Aug 22, 2023
@sukki37 sukki37 modified the milestones: 1.0.0, 1.1.0 Aug 25, 2023
@sukki37 sukki37 added the source/dev issues from devs label Aug 27, 2023
@florashi181 florashi181 changed the title [Refactoring]: Agg and Window function framework [Tech Request]: Agg and Window function framework Oct 16, 2023
@LiSong0214 LiSong0214 modified the milestones: 1.1.0, 1.2.0 Oct 16, 2023
@m-schen
Copy link
Contributor

m-schen commented Jan 23, 2024

代码开发分支: https://github.com/m-schen/matrixone/tree/main-agg

This was referenced Mar 27, 2024
@sukki37 sukki37 modified the milestones: 1.2.0, 1.3.0-Backlog May 28, 2024
@sukki37 sukki37 assigned fengttt and unassigned m-schen Jul 1, 2024
@sukki37 sukki37 modified the milestones: 1.3.0-Backlog, 1.3.0 Jul 1, 2024
@m-schen
Copy link
Contributor

m-schen commented Jul 1, 2024

@fengttt
这个是之前agg重构的issue,忘记流转关闭了。
代码在1.2版本已合入并正常跑了很长时间了。应该可以关闭了。

当前实现一个聚合函数需要关心的有以下几个方法。

  1. 每个group独自持有的上下文结构,若不需要可以为空。
  2. 整个聚合函数共同持有的上下文结构,若不需要可以为空。
  3. fill方法:向聚合函数插入一行数据。
  4. fills方法:向聚合函数插入n次同样的数据,主要用于处理const vector的输入.
  5. merge方法:合并两个聚合函数的结构。
  6. fluish方法:聚合所有的数据都add完毕,计算结果。

以 issue 描述的 avg为例,其聚合函数实现如下:

  1. group context: nil
  2. agg context : nil
  3. init : count = 0, sum = 0
  4. fill : count++, sum += value
  5. fills: count += n, sum += n * value
  6. merge: count += count2, sum += sum2
  7. flush: return sum / count

具体代码可以看1.2分支或者main分支的
RegisterAggFromFixedRetFixed 方法,该方法用于创建一个“输入为 定长类型, 输出为 定长类型 的聚合函数”。

@fengttt fengttt modified the milestones: 2.0.0, 2.1.0 Nov 1, 2024
@fengttt
Copy link
Contributor Author

fengttt commented Nov 5, 2024

For spill, we still miss how to spill the context. This might need 2 method, if the context will spill out some fixed length bytes or variable length bytes. Please add this info.

And please create a doc in the docs repo (if you already have one, please link here).

Thank you.

@fengttt fengttt assigned m-schen and unassigned fengttt Nov 5, 2024
@fengttt
Copy link
Contributor Author

fengttt commented Nov 5, 2024

I see there is a marshal/unmarshal method pair, so please update doc.

Not clear if use marshal/unmarshal is efficient -- because after this we will always treat marshaled context as bytes, that is, varlength. Spill/load varlength is inherently slower than fixed length.
But I am not sure how the spill code plan to handle the spill/load. Reuse vector? (why not ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/tech-request New feature or request priority/p0 Critical feature that should be implemented in this version source/dev issues from devs team/c2
Projects
None yet
Development

No branches or pull requests

7 participants