-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application Architecture Check-List #20
Labels
core
The core foundation of the application
D-No issue activity
Stale issue
R-Design
Design for application architecture
R-enhancement
New feature or request
Comments
KDahir247
added
R-enhancement
New feature or request
R-Design
Design for application architecture
core
The core foundation of the application
labels
Jun 16, 2021
Please leave a comment if I am missing anything to add on or if a piece of my architecture vision is flawed. |
I will be be looking at this check list after I complete each module in the game engine. |
6 tasks
7 tasks
4 tasks
Stale issue message |
I will create a vision paper for the game engine. using this. |
Stale issue message |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
core
The core foundation of the application
D-No issue activity
Stale issue
R-Design
Design for application architecture
R-enhancement
New feature or request
Check List
Test your code to provide proof of the code being implemented correctly for the present and future if changes are made. It will also visualize the code design.
Remove long dependency chains and loop-carried dependency chains with high latency.
Assert before so the later bounds checks might get elided. vec[0] + vec[1] + vec[2] + vec[3] will result in 4 four potential panic points. While asserting the length beforehand will only result in 1 potential panic point. This is for &[T] or Vec since the size can't be determined at compile time.
Use a Higher array index so the compiler can make the assumption that the lower array index will not be out of bounds in the array.
Keep your System one way. Where it takes from one source and produces or updates another. Write your system in single assignments when possible.
Stateful data and Stateless system.
When refactoring keep in mind to normalize your structure.
Encapsulation does nothing, but hide information, bugs, and hidden assumption of the programmer.
Deny generic coding until where it is a good choice to do so.
No Abstraction.
When passing multiple reference types to a function make use of local variables when passing multiple arrays to a function to avoid aliasing.
Try to keep modified values separate from read-only values and write-only values.
No Singleton
Make data immutable from the start and only change it to mutable if it must be.
Avoid if condition as it causes branch miss predication/s and pipeline stalls. Especially avoid if condition in hot code.
Convert as much data into primitive data types.
Use plain old data type.
Keep variable local to the thread to prevent false sharing.
no global variables.
Create local array index value so the compiler doesn't re-fetch and reread data. Avoid hidden assumptions.
No patterns focus on the specific project solution. Not a general solution to solve all cases.
Avoid dynamism.
Avoid complex conditional logic.
Try to keep system side effects free.
Rename variables so it indicates what it is.
Try to reduce function calls.
No Recursion.
Move loops inside function calls.
If using two or higher dimension array [i][j+n] where n is the number is preferred (probably in the same cache line) since it is still stored in a one-dimensional memory compared to array[i + n][j] (probably in main memory).
Convert ..= in iterator to .. (some_value + 1)
Keeping function length below 80 lines. 60 to 65 lines for a function is optimal.
Keep total lines in a single script less than 500 lines
Set a larger Vec capacity then necessary (Vec::With_Capacity(target_len + some_len) rather than Vec::With_Capacity(target_len))
Variable should have the smallest possible scope.
Function should handle one simple task.
Component shouldn't be put on another component, but rather an entity.
Performance matters so do a bench and instrumental test. Disassembly code to see the bottleneck.
We want variables to be as narrowly scoped as possible while getting the job done
Remember
especially in critical inmost loops
Multithreading
Cost of multithreading
Multiple CPU cores or logical processors usually share the same cache, at least the last cache level, and possibly lower cache levels. The advantage of sharing the same cache is that communication between threads becomes faster and threads can share the same code and read-only data. A disadvantage is that the cache will be filled up if threads use different memory areas, as well as there will be cache contention if different threads write to the same memory area.
Data that are read-only can be shared with multiple threads, but modified data should be separated for each thread. The easiest way to declare a local variable for the thread and store a copy of the data is since each thread has its own stack. This will prevent invalidating each other cache, which will cause a large delay.
Have no more threads with the same priority than the number of cores or logical processors available in the system.
Structure containing thread-specific data and making one instance for each thread should be aligned by at least the cache line size (64 bytes) in order to avoid multiple writes to the same cache line. The cache will lock the currently accessed line when one core tries to write into the corresponding memory, so another thread will be held in waiting.
Keep communication and synchronization between threads to a minimum, since all these methods are time-consuming.
Simultaneous multithreading (Hyperthreading) adds more complexity and might be slower than a single thread.
Running two thread in each processor make each thread compete for the same resources, such as cache and execution unit,
They might run half the speed due to being susceptible to cache eviction, cache miss, branch miss prediction, and other conflicts. Use Simultaneous multithreading with great caution, otherwise use a single thread for and single core.
Exception & Testing
Think critically about how errors are handled in your scripts. There are user errors such as invalid input and Programmer errors which is a bugs in the source code itself.
Programming error and the logic error should be handled where it is and should either crash the program and not be propagated, while user error should be handled gracefully and should only be propagated one level (function for error -> where user call function).
Programming errors and logic errors should be minimized by adding tests.
Floating point and vector code will relay on propagating NAN and INF. No exceptions.
Performance
optimizing for speed is relevant when CPU access and memory access are important.
optimizing for size is relevant when code cache, micro-op cache, and loop buffer are important.
toml implementation idea
Set game engine module dependency as optional if possible
Function parameters should not have two or more references of the same type. This might possibly cause aliasing
Subtle but important. Remove all println! and other relative calls.
Code cache works efficiently if functions are used near each other and are also stored near each other. Collect critical parts of the code near the top of the source code near each other in the same source file.
Separate frequently called function from seldom used function and seldom used function such as error handling below.
Function order (top-bottom)
Pieces of data that are used together are stored near each other in memory.
Variables should be declared inside the function in which they are used. It will be stored on the stack, which is very likely to be in the level 1 cache. Avoid static and global variables
Cache works most efficiently when the data are accessed sequentially. It works less efficiently when data are accessed backward and much less if data is accessed randomly. This applies to reading and writing to the data/s.
A write with a cache miss is more expensive than an un-cache read because a write causes the entire cache line (commonly 64 bytes) to be read and written back.
utilize out-of-order execution. Sometimes compiler maybe optimize it for you, but floating-point is almost always disregarded in this optimization due to precision loss
No out-of-order execution optimization
out-of-order execution optimization
An important factor to take into account for out-of-order execution;
Prevent false dependences by writing to a full register rather than a partial register (RAX or EAX, not AX or especially AH)
INC and DEC are inefficient on some CPU use ADD or SUB
Chain of instruction that depends on previous instruction/s can't utilize out-of-order execution
The CPU can do more things simultaneously if the code contains a good mixture of instructions from different categories in the execution unit (ALU, FPU, AGU, etc...)
The condition that works best for auto-vectorization
The compiler may fail to vectorize the code if these conditions persist
For critical functions that use multiplication. Use the shift left operator if the operand is a power of 2
Or you may use the shift left operator if the operand isn't a power of two, but it will require an intermediate step
This solution should only be used in the critical section and hot code. The compiler can usually optimize multiplying with powers of two and multiplying with 3, 5, and 9.
Avoid making very big data structure size be constrained to the power of two to prevent very expensive cache contention.
Unsigned int division is faster than signed when dividing with and without a constant. Remember converting between signed and unsigned is a relatively cheap operation. The same rules apply to modulo.
Floating point division by a constant should be done by multiplying with the reciprocal. constant (1 / some_constant)
can be usually calculated at compile time, thus will only do multiplication rather than division. There is usually a heavy guard to prevent compilers to optimize for this due to precision loss so it is best practice to do this explicitly.
Conversion of signed integers to floating point is fast only when the SSE2 instruction set is
enabled. Conversion of unsigned integers to floating point is faster only when the AVX512
instruction set is enabled
use vectorcall calling convention for math library to store the function parameter in the SIMD registers
The function should take no more than 4 parameters since no more than four parameters can be transferred in registers. Any further parameter is transferred in the stack for the window operating system. Linux, BSD, and mac operating systems can transfer up to fourteen parameters (six integers, 8 floating points) to the register any more parameter is transferred in the stack. A cap of four parameters satisfies both Windows and Linux operating systems where function parameters will be stored in the register rather than the stack space.
The maximum average number of μops per clock cycle is three or four on many processors. It may be possible, for example, to do an integer operation, a floating point operation, and a memory operation in the same clock cycle. The maximum number of arithmetic operations (i.e. anything else than memory read or write) is limited to two, three, or four μops per clock cycle, depending on the CPU. We default it at two and rarely use three. We can do floating point and integer arithmetic on the same clock cycle if the instruction is using a different execution unit and if there is a multiple of the same execution unit.
Keep the loop small so that they are likely to fit into the μop cache or loop buffer for the CPU that has both or either. Avoid unnecessary loop unrolling
The optimal number of an accumulator is the latency of the instruction divided by the reciprocal throughput.
Eg.
Beware of compiler padding. Order variables by decreasing size if possible. (High to low)
Align struct to be divisible by a power of two
A big array and big objects should come last in structure. Keep the size of the struct below 128 bytes.
Avoid creating different functions for critical speed operation. Rarely access codes such as error handling and keep it in a separate function if needed from the critical hot path function.
Struct that contains three or four f32 should be passed by copy (12 bytes or 16 bytes) as a parameter in the function. Anything greater than 16 bytes should be passed by reference.
The text was updated successfully, but these errors were encountered: