Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application Architecture Check-List #20

Closed
38 tasks
KDahir247 opened this issue Jun 16, 2021 · 5 comments
Closed
38 tasks

Application Architecture Check-List #20

KDahir247 opened this issue Jun 16, 2021 · 5 comments
Assignees
Labels
core The core foundation of the application D-No issue activity Stale issue R-Design Design for application architecture R-enhancement New feature or request

Comments

@KDahir247
Copy link
Owner

KDahir247 commented Jun 16, 2021

Check List

  • Test your code to provide proof of the code being implemented correctly for the present and future if changes are made. It will also visualize the code design.

  • Remove long dependency chains and loop-carried dependency chains with high latency.

  • Assert before so the later bounds checks might get elided. vec[0] + vec[1] + vec[2] + vec[3] will result in 4 four potential panic points. While asserting the length beforehand will only result in 1 potential panic point. This is for &[T] or Vec since the size can't be determined at compile time.

  • Use a Higher array index so the compiler can make the assumption that the lower array index will not be out of bounds in the array.

let good = array[4] * array[3] * array[2] * array[1] * array[0]; // one bound check at array[4]
let bad = array[0] * array[1] * array[2] * array[3] * array[4]; // five bound check for array
  • Keep your System one way. Where it takes from one source and produces or updates another. Write your system in single assignments when possible.

  • Stateful data and Stateless system.

  • When refactoring keep in mind to normalize your structure.

  • Encapsulation does nothing, but hide information, bugs, and hidden assumption of the programmer.

  • Deny generic coding until where it is a good choice to do so.

  • No Abstraction.

  • When passing multiple reference types to a function make use of local variables when passing multiple arrays to a function to avoid aliasing.

  • Try to keep modified values separate from read-only values and write-only values.

  • No Singleton

  • Make data immutable from the start and only change it to mutable if it must be.

  • Avoid if condition as it causes branch miss predication/s and pipeline stalls. Especially avoid if condition in hot code.

  • Convert as much data into primitive data types.

  • Use plain old data type.

  • Keep variable local to the thread to prevent false sharing.

  • no global variables.

  • Create local array index value so the compiler doesn't re-fetch and reread data. Avoid hidden assumptions.

  • No patterns focus on the specific project solution. Not a general solution to solve all cases.

  • Avoid dynamism.

  • Avoid complex conditional logic.

  • Try to keep system side effects free.

  • Rename variables so it indicates what it is.

  • Try to reduce function calls.

  • No Recursion.

  • Move loops inside function calls.

  • If using two or higher dimension array [i][j+n] where n is the number is preferred (probably in the same cache line) since it is still stored in a one-dimensional memory compared to array[i + n][j] (probably in main memory).

  • Convert ..= in iterator to .. (some_value + 1)

  • Keeping function length below 80 lines. 60 to 65 lines for a function is optimal.

  • Keep total lines in a single script less than 500 lines

  • Set a larger Vec capacity then necessary (Vec::With_Capacity(target_len + some_len) rather than Vec::With_Capacity(target_len))

  • Variable should have the smallest possible scope.

  • Function should handle one simple task.

  • Component shouldn't be put on another component, but rather an entity.

// not good. Spotlight should have LightAppearance, but Light appearance and Spotlight should be added as a separate 
// component to the entity.
pub struct SpotLight {
   pub appearance: LightAppearance,
   pub intensity: f32,
   pub range: f32
}
  • Performance matters so do a bench and instrumental test. Disassembly code to see the bottleneck.

  • We want variables to be as narrowly scoped as possible while getting the job done

Remember

  • You can't be fast without knowing how your data is touched.
  • You can't multi-thread without knowing how your data is touched.
  • You can't off-load without knowing how your data is touched.
  • Isolate your systems and do many at once. In burst.
  • Be explicit about reading or writing access to data.
  • If data doesn't require change make it immutable by default.
  • Data is compared by value not reference.
  • When are you reading data how much are you using it?
  • Thread should only access local variables to the thread and not global variables.
  • Transform a collection of homogeneous data only. (the thing that is related)
  • Avoid complex conditional logic.
  • Strive to calculate as much as possible during compilation time.
  • Keep your code readable.
  • Breaking out will usually break auto-vectorization.
  • Good architecture is as if the entire program was crafted in anticipation of it.
  • If modifying nearby data on multiple threads. It is faster to have it on a different cache line to prevent costly synchronization.
  • Aliasing can only occur between references of the same compatible type.
  • When are you reading data how much are you using it?
  • Lazing evaluation costs more than the job it is trying to avoid.
  • When optimizing know the optimization will make a difference. Use tools. Don't just look for a faster frame time. Look for latency throughput, frame time, and memory usage.
  • Optimizing infrequently used code probably shouldn't be optimized little.
  • Write for correctness then optimize.
  • If you don't need a return value from a function. don't define one.
  • assertion should catch bugs in the program itself and never catch user errors.
  • Steer away from complex branching and mutable
  • Try keeping the local variable in function small, since we want it to be in the register, and there is a finite amount of registers
  • unsigned int division is faster than signed when dividing with a constant
  • use signed int to convert to floating point type rather than unsigned int
  • use single precision float unless you need higher precision.
  • avoid conversion from integer to floating point.
  • converting between signed and unsigned integers is very fast.
  • floating point to integer type takes a long time unless SEE2 or later is enabled, so try avoiding it.
  • Avoid excessive branches and function calls. It will cause contention in the branch target buffer which will cause mispredicted even if they otherwise would be predicted well this also applies to function calls
  • Try to keep loop scope small
  • prefer and strive for leaf function (a function that makes no calls) over frame function (a function that makes calls)
    especially in critical inmost loops

Multithreading


Cost of multithreading

  • Cost of starting and stopping threads. Don't put light tasks that duration is smaller than the time it takes to start or stop a thread.
  • Cost of task switching. minimize the number of threads with the same priority to be no more than the number of CPU cores.
  • Cost of synchronization and communicating. Each thread should be oblivious to other thread data.
  • The different threads need separate storage. Avoid global or shared data and/or make each data on a separate cache line (increases memory usage). This can cause cache contention if the threads share the same cache
  • Multithreading programming must use a thread-safe function. A thread-safe function should never use static variables

Multiple CPU cores or logical processors usually share the same cache, at least the last cache level, and possibly lower cache levels. The advantage of sharing the same cache is that communication between threads becomes faster and threads can share the same code and read-only data. A disadvantage is that the cache will be filled up if threads use different memory areas, as well as there will be cache contention if different threads write to the same memory area.

Data that are read-only can be shared with multiple threads, but modified data should be separated for each thread. The easiest way to declare a local variable for the thread and store a copy of the data is since each thread has its own stack. This will prevent invalidating each other cache, which will cause a large delay.

Have no more threads with the same priority than the number of cores or logical processors available in the system.

Structure containing thread-specific data and making one instance for each thread should be aligned by at least the cache line size (64 bytes) in order to avoid multiple writes to the same cache line. The cache will lock the currently accessed line when one core tries to write into the corresponding memory, so another thread will be held in waiting.

Keep communication and synchronization between threads to a minimum, since all these methods are time-consuming.

Simultaneous multithreading (Hyperthreading) adds more complexity and might be slower than a single thread.
Running two thread in each processor make each thread compete for the same resources, such as cache and execution unit,
They might run half the speed due to being susceptible to cache eviction, cache miss, branch miss prediction, and other conflicts. Use Simultaneous multithreading with great caution, otherwise use a single thread for and single core.


Exception & Testing

Think critically about how errors are handled in your scripts. There are user errors such as invalid input and Programmer errors which is a bugs in the source code itself.
Programming error and the logic error should be handled where it is and should either crash the program and not be propagated, while user error should be handled gracefully and should only be propagated one level (function for error -> where user call function).

Programming errors and logic errors should be minimized by adding tests.

Floating point and vector code will relay on propagating NAN and INF. No exceptions.

Performance

optimizing for speed is relevant when CPU access and memory access are important.

optimizing for size is relevant when code cache, micro-op cache, and loop buffer are important.

toml implementation idea

[profile.release]
debug = false
lto = true
codegen-units = 1
opt-level = s  # not z, since it turn off auto-vectorization we will go s (2 but focus on reducing size)
panic = "abort"
strip = "true"

Set game engine module dependency as optional if possible

Function parameters should not have two or more references of the same type. This might possibly cause aliasing

Subtle but important. Remove all println! and other relative calls.

Code cache works efficiently if functions are used near each other and are also stored near each other. Collect critical parts of the code near the top of the source code near each other in the same source file.
Separate frequently called function from seldom used function and seldom used function such as error handling below.

Function order (top-bottom)

  1. Critical function
  2. Frequent call function
  3. seldom function
  4. misc function (error handling, etc...)

Pieces of data that are used together are stored near each other in memory.
Variables should be declared inside the function in which they are used. It will be stored on the stack, which is very likely to be in the level 1 cache. Avoid static and global variables

Cache works most efficiently when the data are accessed sequentially. It works less efficiently when data are accessed backward and much less if data is accessed randomly. This applies to reading and writing to the data/s.

A write with a cache miss is more expensive than an un-cache read because a write causes the entire cache line (commonly 64 bytes) to be read and written back.

utilize out-of-order execution. Sometimes compiler maybe optimize it for you, but floating-point is almost always disregarded in this optimization due to precision loss

No out-of-order execution optimization

// Bad
pub fn main(){

   let x = 1.0;
    let y = 2.0;
    let z = 3.0;
    let w = 4.0;
    let mut res = 0.0;
    
    res = x + y + z + w;
}

out-of-order execution optimization

pub fn main(){

   let x = 1.0;
    let y = 2.0;
    let z = 3.0;
    let w = 4.0;
    let mut res = 0.0;
    
    res = (x + y) + (z + w);
   
}

An important factor to take into account for out-of-order execution;

  • Prevent false dependences by writing to a full register rather than a partial register (RAX or EAX, not AX or especially AH)

  • INC and DEC are inefficient on some CPU use ADD or SUB

  • Chain of instruction that depends on previous instruction/s can't utilize out-of-order execution

  • The CPU can do more things simultaneously if the code contains a good mixture of instructions from different categories in the execution unit (ALU, FPU, AGU, etc...)

The condition that works best for auto-vectorization

  1. Align array and big structure to 16 for SSE, 32 for AVX, and preferably 64 for AVX512
  2. Loop counter should preferably be constant and divisible by the number of elements in the vector
  3. Minimize the use of branches.

The compiler may fail to vectorize the code if these conditions persist

  1. Cannot rule out that not-taken branches will generate exceptions or other side effects
  2. Compiler doesn't know the size of the array is multiple vector size
  3. Compiler doesn't know if the data structure is aligned properly
  4. Re-arrangement to fit into vector
  5. Code to complex
  6. External function call that isn't available in the vector version

For critical functions that use multiplication. Use the shift left operator if the operand is a power of 2
Or you may use the shift left operator if the operand isn't a power of two, but it will require an intermediate step

pub fn main(){
   let a  = 5;
   let res = a * 4;
   let optimized_res = a << 2;
  
  let a = 5;
  let res = a * 7;
  let intermidate_optimized =15; // or a + a + a  since a << 2 == 4 , so 7 - 4 == 3 thus needing to add a variable 3 times
  let optimized_res = a << 2 + intermidate_optimized;
}

This solution should only be used in the critical section and hot code. The compiler can usually optimize multiplying with powers of two and multiplying with 3, 5, and 9.

Avoid making very big data structure size be constrained to the power of two to prevent very expensive cache contention.

Unsigned int division is faster than signed when dividing with and without a constant. Remember converting between signed and unsigned is a relatively cheap operation. The same rules apply to modulo.

Floating point division by a constant should be done by multiplying with the reciprocal. constant (1 / some_constant)
can be usually calculated at compile time, thus will only do multiplication rather than division. There is usually a heavy guard to prevent compilers to optimize for this due to precision loss so it is best practice to do this explicitly.

pub fn main(){
  let a : f32 = 5.0; 
  let res : f32 =  a / 1.24816; // not quite good.
  let res_optimized = a * (1.0 / 1.24816); // better because some compiler will get the value at compile time.
  let res_best = a * 0.801179335 // best solution is just to compute this.

}

Conversion of signed integers to floating point is fast only when the SSE2 instruction set is
enabled. Conversion of unsigned integers to floating point is faster only when the AVX512
instruction set is enabled

use vectorcall calling convention for math library to store the function parameter in the SIMD registers

The function should take no more than 4 parameters since no more than four parameters can be transferred in registers. Any further parameter is transferred in the stack for the window operating system. Linux, BSD, and mac operating systems can transfer up to fourteen parameters (six integers, 8 floating points) to the register any more parameter is transferred in the stack. A cap of four parameters satisfies both Windows and Linux operating systems where function parameters will be stored in the register rather than the stack space.

The maximum average number of μops per clock cycle is three or four on many processors. It may be possible, for example, to do an integer operation, a floating point operation, and a memory operation in the same clock cycle. The maximum number of arithmetic operations (i.e. anything else than memory read or write) is limited to two, three, or four μops per clock cycle, depending on the CPU. We default it at two and rarely use three. We can do floating point and integer arithmetic on the same clock cycle if the instruction is using a different execution unit and if there is a multiple of the same execution unit.

Keep the loop small so that they are likely to fit into the μop cache or loop buffer for the CPU that has both or either. Avoid unnecessary loop unrolling

The optimal number of an accumulator is the latency of the instruction divided by the reciprocal throughput.
Eg.

pub fn main(){
   let mut sum0 = 0.0;
   let mut sum1 = 0.0;
   let mut sum2 = 0.0;
   let mut sum3 = 0.0;
   let array = [1.0; 100]; // let say this has random value even though it all ones.

  // addss (add single scalar)
  // if the FADD addss instruction latency is 4 and the reciprocal throughput is 1 then we can do it (4 / 1) or 4 times
   for index in (0..100).stepby(4){
     // we do [index + 3] since there might be bound check, but rust compiler can elide bound check for lower array index, so only one bound check
      sum3 += array[index + 3]; 
      sum2 += array[index + 2];
      sum1 += array[index + 1];
      sum0 += array[index];
  }

let sum_result = (sum0 + sum1) + (sum2 + sum3);

}

Beware of compiler padding. Order variables by decreasing size if possible. (High to low)
Align struct to be divisible by a power of two
A big array and big objects should come last in structure. Keep the size of the struct below 128 bytes.

struct User1 {
    arg0 : u32, // 4 bytes
    // padding add 4 bytes
    arg1: u64, // 8 bytes
    arg2: u8, // 1 bytes
   // padding add 7 bytes
    arg3: u64, // 8 bytes
}

vs

struct User {
    arg3: u64, // 8 bytes
    arg1: u64, // 8 bytes
    arg0 : u32, // 4 bytes
    arg2: u8, // 1 bytes
   // padding add 3 bytes
}

Avoid creating different functions for critical speed operation. Rarely access codes such as error handling and keep it in a separate function if needed from the critical hot path function.

Struct that contains three or four f32 should be passed by copy (12 bytes or 16 bytes) as a parameter in the function. Anything greater than 16 bytes should be passed by reference.

@KDahir247 KDahir247 added R-enhancement New feature or request R-Design Design for application architecture core The core foundation of the application labels Jun 16, 2021
@KDahir247 KDahir247 self-assigned this Jun 16, 2021
@KDahir247
Copy link
Owner Author

Please leave a comment if I am missing anything to add on or if a piece of my architecture vision is flawed.

@KDahir247 KDahir247 pinned this issue Jun 16, 2021
@KDahir247
Copy link
Owner Author

I will be be looking at this check list after I complete each module in the game engine.

@github-actions
Copy link

Stale issue message

@github-actions github-actions bot added the D-No issue activity Stale issue label Jan 20, 2022
@KDahir247 KDahir247 reopened this Jan 27, 2022
@KDahir247
Copy link
Owner Author

I will create a vision paper for the game engine. using this.
It will be cleaner and will be modulized.
I will follow the vision paper for crafting the game engine.

@KDahir247 KDahir247 reopened this Sep 25, 2022
@github-actions github-actions bot removed the D-No issue activity Stale issue label Sep 26, 2022
@github-actions
Copy link

Stale issue message

@github-actions github-actions bot added the D-No issue activity Stale issue label Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core The core foundation of the application D-No issue activity Stale issue R-Design Design for application architecture R-enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant