MSDN Docs

Performance Best Practices

Profiling and Measurement

Effective performance optimization starts with accurate measurement. Use the following tools to profile your application:

  • Windows Performance Analyzer (WPA)
  • Visual Studio Profiler
  • ETW (Event Tracing for Windows)

Collect data in a production‑like environment and focus on the top 5% of time‑consuming calls.

CPU Usage

Reduce CPU cycles by:

  1. Avoiding unnecessary recomputation – cache results when possible.
  2. Using SIMD intrinsics (#include <intrin.h>) for data‑parallel work.
  3. Preferring stack allocation over heap for small, short‑lived objects.
#include <immintrin.h>
void AddVectors(const float* a, const float* b, float* c, size_t n) {
    for (size_t i = 0; i + 7 < n; i += 8) {
        __m256 va = _mm256_load_ps(a + i);
        __m256 vb = _mm256_load_ps(b + i);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_store_ps(c + i, vc);
    }
    // handle remainder...
}

I/O Optimization

For file and network I/O:

  • Use overlapped I/O with ReadFileEx/WriteFileEx.
  • Employ memory‑mapped files (CreateFileMapping, MapViewOfFile) for large sequential reads.
  • Batch small writes into larger buffers.
HANDLE hFile = CreateFile(L"data.bin",GENERIC_READ,FILE_SHARE_READ,NULL,
    OPEN_EXISTING,FILE_FLAG_OVERLAPPED,NULL);
OVERLAPPED ov = {0};
char buffer[64*1024];
BOOL ok = ReadFile(hFile,buffer,sizeof(buffer),NULL,&ov);

Memory Management

Minimize fragmentation and allocation overhead:

  • Prefer HeapAlloc for variable‑size blocks and VirtualAlloc for large buffers.
  • Use malloc/free only in C++ when not using the Windows heap APIs.
  • Align data structures to cache line size (typically 64 bytes).
#define CACHE_LINE 64
struct alignas(CACHE_LINE) AlignedBuffer {
    char data[256];
};

Threading and Concurrency

Guidelines for efficient multithreading:

  • Use thread pools (CreateThreadpoolWork) instead of creating raw threads.
  • Avoid lock contention; prefer lock‑free structures or SRWLOCK.
  • Set thread affinity only when necessary to improve cache locality.
PTHREAD_POOL_WORK work = CreateThreadpoolWork(WorkCallback,NULL,NULL);
SubmitThreadpoolWork(work);

Power Efficiency

For battery‑powered devices:

  • Use SetThreadExecutionState to prevent unnecessary wake‑ups.
  • Throttle background work with timers and WaitableTimer.
  • Leverage PowerCreateRequest for short‑term high‑performance bursts.