Application Architecture Check-List #20

KDahir247 · 2021-06-16T06:25:27Z

Check List

Test your code to provide proof of the code being implemented correctly for the present and future if changes are made. It will also visualize the code design.
Remove long dependency chains and loop-carried dependency chains with high latency.
Assert before so the later bounds checks might get elided. vec[0] + vec[1] + vec[2] + vec[3] will result in 4 four potential panic points. While asserting the length beforehand will only result in 1 potential panic point. This is for &[T] or Vec since the size can't be determined at compile time.
Use a Higher array index so the compiler can make the assumption that the lower array index will not be out of bounds in the array.

let good = array[4] * array[3] * array[2] * array[1] * array[0]; // one bound check at array[4]
let bad = array[0] * array[1] * array[2] * array[3] * array[4]; // five bound check for array

// not good. Spotlight should have LightAppearance, but Light appearance and Spotlight should be added as a separate 
// component to the entity.
pub struct SpotLight {
   pub appearance: LightAppearance,
   pub intensity: f32,
   pub range: f32
}

Performance matters so do a bench and instrumental test. Disassembly code to see the bottleneck.
We want variables to be as narrowly scoped as possible while getting the job done

Remember

You can't be fast without knowing how your data is touched.
You can't multi-thread without knowing how your data is touched.
You can't off-load without knowing how your data is touched.
Isolate your systems and do many at once. In burst.
Be explicit about reading or writing access to data.
If data doesn't require change make it immutable by default.
Data is compared by value not reference.
When are you reading data how much are you using it?
Thread should only access local variables to the thread and not global variables.
Transform a collection of homogeneous data only. (the thing that is related)
Avoid complex conditional logic.
Strive to calculate as much as possible during compilation time.
Keep your code readable.
Breaking out will usually break auto-vectorization.
Good architecture is as if the entire program was crafted in anticipation of it.
If modifying nearby data on multiple threads. It is faster to have it on a different cache line to prevent costly synchronization.
Aliasing can only occur between references of the same compatible type.
When are you reading data how much are you using it?
Lazing evaluation costs more than the job it is trying to avoid.
When optimizing know the optimization will make a difference. Use tools. Don't just look for a faster frame time. Look for latency throughput, frame time, and memory usage.
Optimizing infrequently used code probably shouldn't be optimized little.
Write for correctness then optimize.
If you don't need a return value from a function. don't define one.
assertion should catch bugs in the program itself and never catch user errors.
Steer away from complex branching and mutable
Try keeping the local variable in function small, since we want it to be in the register, and there is a finite amount of registers
unsigned int division is faster than signed when dividing with a constant
use signed int to convert to floating point type rather than unsigned int
use single precision float unless you need higher precision.
avoid conversion from integer to floating point.
converting between signed and unsigned integers is very fast.
floating point to integer type takes a long time unless SEE2 or later is enabled, so try avoiding it.
Avoid excessive branches and function calls. It will cause contention in the branch target buffer which will cause mispredicted even if they otherwise would be predicted well this also applies to function calls
Try to keep loop scope small
prefer and strive for leaf function (a function that makes no calls) over frame function (a function that makes calls)
especially in critical inmost loops

Multithreading

Cost of multithreading

Cost of starting and stopping threads. Don't put light tasks that duration is smaller than the time it takes to start or stop a thread.
Cost of task switching. minimize the number of threads with the same priority to be no more than the number of CPU cores.
Cost of synchronization and communicating. Each thread should be oblivious to other thread data.
The different threads need separate storage. Avoid global or shared data and/or make each data on a separate cache line (increases memory usage). This can cause cache contention if the threads share the same cache
Multithreading programming must use a thread-safe function. A thread-safe function should never use static variables

Multiple CPU cores or logical processors usually share the same cache, at least the last cache level, and possibly lower cache levels. The advantage of sharing the same cache is that communication between threads becomes faster and threads can share the same code and read-only data. A disadvantage is that the cache will be filled up if threads use different memory areas, as well as there will be cache contention if different threads write to the same memory area.

Data that are read-only can be shared with multiple threads, but modified data should be separated for each thread. The easiest way to declare a local variable for the thread and store a copy of the data is since each thread has its own stack. This will prevent invalidating each other cache, which will cause a large delay.

Have no more threads with the same priority than the number of cores or logical processors available in the system.

Structure containing thread-specific data and making one instance for each thread should be aligned by at least the cache line size (64 bytes) in order to avoid multiple writes to the same cache line. The cache will lock the currently accessed line when one core tries to write into the corresponding memory, so another thread will be held in waiting.

Keep communication and synchronization between threads to a minimum, since all these methods are time-consuming.

Simultaneous multithreading (Hyperthreading) adds more complexity and might be slower than a single thread.
Running two thread in each processor make each thread compete for the same resources, such as cache and execution unit,
They might run half the speed due to being susceptible to cache eviction, cache miss, branch miss prediction, and other conflicts. Use Simultaneous multithreading with great caution, otherwise use a single thread for and single core.

Exception & Testing

Think critically about how errors are handled in your scripts. There are user errors such as invalid input and Programmer errors which is a bugs in the source code itself.
Programming error and the logic error should be handled where it is and should either crash the program and not be propagated, while user error should be handled gracefully and should only be propagated one level (function for error -> where user call function).

Programming errors and logic errors should be minimized by adding tests.

Floating point and vector code will relay on propagating NAN and INF. No exceptions.

Performance

optimizing for speed is relevant when CPU access and memory access are important.

optimizing for size is relevant when code cache, micro-op cache, and loop buffer are important.

toml implementation idea

[profile.release]
debug = false
lto = true
codegen-units = 1
opt-level = s  # not z, since it turn off auto-vectorization we will go s (2 but focus on reducing size)
panic = "abort"
strip = "true"

Set game engine module dependency as optional if possible

Function parameters should not have two or more references of the same type. This might possibly cause aliasing

Subtle but important. Remove all println! and other relative calls.

Code cache works efficiently if functions are used near each other and are also stored near each other. Collect critical parts of the code near the top of the source code near each other in the same source file.
Separate frequently called function from seldom used function and seldom used function such as error handling below.

Function order (top-bottom)

Critical function
Frequent call function
seldom function
misc function (error handling, etc...)

Pieces of data that are used together are stored near each other in memory.
Variables should be declared inside the function in which they are used. It will be stored on the stack, which is very likely to be in the level 1 cache. Avoid static and global variables

Cache works most efficiently when the data are accessed sequentially. It works less efficiently when data are accessed backward and much less if data is accessed randomly. This applies to reading and writing to the data/s.

A write with a cache miss is more expensive than an un-cache read because a write causes the entire cache line (commonly 64 bytes) to be read and written back.

utilize out-of-order execution. Sometimes compiler maybe optimize it for you, but floating-point is almost always disregarded in this optimization due to precision loss

No out-of-order execution optimization

// Bad
pub fn main(){

   let x = 1.0;
    let y = 2.0;
    let z = 3.0;
    let w = 4.0;
    let mut res = 0.0;
    
    res = x + y + z + w;
}

out-of-order execution optimization

pub fn main(){

   let x = 1.0;
    let y = 2.0;
    let z = 3.0;
    let w = 4.0;
    let mut res = 0.0;
    
    res = (x + y) + (z + w);
   
}

An important factor to take into account for out-of-order execution;

Prevent false dependences by writing to a full register rather than a partial register (RAX or EAX, not AX or especially AH)
INC and DEC are inefficient on some CPU use ADD or SUB
Chain of instruction that depends on previous instruction/s can't utilize out-of-order execution
The CPU can do more things simultaneously if the code contains a good mixture of instructions from different categories in the execution unit (ALU, FPU, AGU, etc...)

The condition that works best for auto-vectorization

Align array and big structure to 16 for SSE, 32 for AVX, and preferably 64 for AVX512
Loop counter should preferably be constant and divisible by the number of elements in the vector
Minimize the use of branches.

The compiler may fail to vectorize the code if these conditions persist

Cannot rule out that not-taken branches will generate exceptions or other side effects
Compiler doesn't know the size of the array is multiple vector size
Compiler doesn't know if the data structure is aligned properly
Re-arrangement to fit into vector
Code to complex
External function call that isn't available in the vector version

For critical functions that use multiplication. Use the shift left operator if the operand is a power of 2
Or you may use the shift left operator if the operand isn't a power of two, but it will require an intermediate step

pub fn main(){
   let a  = 5;
   let res = a * 4;
   let optimized_res = a << 2;
  
  let a = 5;
  let res = a * 7;
  let intermidate_optimized =15; // or a + a + a  since a << 2 == 4 , so 7 - 4 == 3 thus needing to add a variable 3 times
  let optimized_res = a << 2 + intermidate_optimized;
}

This solution should only be used in the critical section and hot code. The compiler can usually optimize multiplying with powers of two and multiplying with 3, 5, and 9.

Avoid making very big data structure size be constrained to the power of two to prevent very expensive cache contention.

Unsigned int division is faster than signed when dividing with and without a constant. Remember converting between signed and unsigned is a relatively cheap operation. The same rules apply to modulo.

Floating point division by a constant should be done by multiplying with the reciprocal. constant (1 / some_constant)
can be usually calculated at compile time, thus will only do multiplication rather than division. There is usually a heavy guard to prevent compilers to optimize for this due to precision loss so it is best practice to do this explicitly.

pub fn main(){
  let a : f32 = 5.0; 
  let res : f32 =  a / 1.24816; // not quite good.
  let res_optimized = a * (1.0 / 1.24816); // better because some compiler will get the value at compile time.
  let res_best = a * 0.801179335 // best solution is just to compute this.

}

Conversion of signed integers to floating point is fast only when the SSE2 instruction set is
enabled. Conversion of unsigned integers to floating point is faster only when the AVX512
instruction set is enabled

use vectorcall calling convention for math library to store the function parameter in the SIMD registers

The function should take no more than 4 parameters since no more than four parameters can be transferred in registers. Any further parameter is transferred in the stack for the window operating system. Linux, BSD, and mac operating systems can transfer up to fourteen parameters (six integers, 8 floating points) to the register any more parameter is transferred in the stack. A cap of four parameters satisfies both Windows and Linux operating systems where function parameters will be stored in the register rather than the stack space.

The maximum average number of μops per clock cycle is three or four on many processors. It may be possible, for example, to do an integer operation, a floating point operation, and a memory operation in the same clock cycle. The maximum number of arithmetic operations (i.e. anything else than memory read or write) is limited to two, three, or four μops per clock cycle, depending on the CPU. We default it at two and rarely use three. We can do floating point and integer arithmetic on the same clock cycle if the instruction is using a different execution unit and if there is a multiple of the same execution unit.

Keep the loop small so that they are likely to fit into the μop cache or loop buffer for the CPU that has both or either. Avoid unnecessary loop unrolling

The optimal number of an accumulator is the latency of the instruction divided by the reciprocal throughput.
Eg.

pub fn main(){
   let mut sum0 = 0.0;
   let mut sum1 = 0.0;
   let mut sum2 = 0.0;
   let mut sum3 = 0.0;
   let array = [1.0; 100]; // let say this has random value even though it all ones.

  // addss (add single scalar)
  // if the FADD addss instruction latency is 4 and the reciprocal throughput is 1 then we can do it (4 / 1) or 4 times
   for index in (0..100).stepby(4){
     // we do [index + 3] since there might be bound check, but rust compiler can elide bound check for lower array index, so only one bound check
      sum3 += array[index + 3]; 
      sum2 += array[index + 2];
      sum1 += array[index + 1];
      sum0 += array[index];
  }

let sum_result = (sum0 + sum1) + (sum2 + sum3);

}

Beware of compiler padding. Order variables by decreasing size if possible. (High to low)
Align struct to be divisible by a power of two
A big array and big objects should come last in structure. Keep the size of the struct below 128 bytes.

struct User1 {
    arg0 : u32, // 4 bytes
    // padding add 4 bytes
    arg1: u64, // 8 bytes
    arg2: u8, // 1 bytes
   // padding add 7 bytes
    arg3: u64, // 8 bytes
}

vs

struct User {
    arg3: u64, // 8 bytes
    arg1: u64, // 8 bytes
    arg0 : u32, // 4 bytes
    arg2: u8, // 1 bytes
   // padding add 3 bytes
}

Avoid creating different functions for critical speed operation. Rarely access codes such as error handling and keep it in a separate function if needed from the critical hot path function.

Struct that contains three or four f32 should be passed by copy (12 bytes or 16 bytes) as a parameter in the function. Anything greater than 16 bytes should be passed by reference.

The text was updated successfully, but these errors were encountered:

KDahir247 · 2021-06-16T06:35:38Z

Please leave a comment if I am missing anything to add on or if a piece of my architecture vision is flawed.

KDahir247 · 2021-09-05T12:15:18Z

I will be be looking at this check list after I complete each module in the game engine.

github-actions · 2022-01-20T03:31:31Z

Stale issue message

KDahir247 · 2022-09-25T15:40:02Z

I will create a vision paper for the game engine. using this.
It will be cleaner and will be modulized.
I will follow the vision paper for crafting the game engine.

github-actions · 2023-01-21T03:27:49Z

Stale issue message

KDahir247 added R-enhancement New feature or request R-Design Design for application architecture core The core foundation of the application labels Jun 16, 2021

KDahir247 self-assigned this Jun 16, 2021

KDahir247 pinned this issue Jun 16, 2021

KDahir247 mentioned this issue Sep 7, 2021

Camera Space conversion #35

Closed

6 tasks

KDahir247 mentioned this issue Sep 19, 2021

merge progress render module with main #38

Closed

6 tasks

KDahir247 mentioned this issue Oct 15, 2021

merge progress audio module with main #53

Closed

7 tasks

KDahir247 mentioned this issue Nov 11, 2021

merge progress obj module with main #61

Closed

4 tasks

KDahir247 mentioned this issue Dec 19, 2021

merge progress transform module with main branch #74

Closed

4 tasks

github-actions bot added the D-No issue activity Stale issue label Jan 20, 2022

github-actions bot closed this as completed Jan 27, 2022

KDahir247 reopened this Jan 27, 2022

github-actions bot closed this as completed Feb 16, 2022

KDahir247 reopened this Sep 25, 2022

github-actions bot removed the D-No issue activity Stale issue label Sep 26, 2022

github-actions bot added the D-No issue activity Stale issue label Jan 21, 2023

github-actions bot closed this as completed Jan 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application Architecture Check-List #20

Application Architecture Check-List #20

KDahir247 commented Jun 16, 2021 •

edited

Loading

KDahir247 commented Jun 16, 2021

KDahir247 commented Sep 5, 2021

github-actions bot commented Jan 20, 2022

KDahir247 commented Sep 25, 2022

github-actions bot commented Jan 21, 2023

Application Architecture Check-List #20

Application Architecture Check-List #20

Comments

KDahir247 commented Jun 16, 2021 • edited Loading

Check List

Remember

Multithreading

Cost of multithreading

Exception & Testing

Performance

KDahir247 commented Jun 16, 2021

KDahir247 commented Sep 5, 2021

github-actions bot commented Jan 20, 2022

KDahir247 commented Sep 25, 2022

github-actions bot commented Jan 21, 2023

KDahir247 commented Jun 16, 2021 •

edited

Loading