Optimizing C++ Code

Austin Prescott
Apr 18, 2021
7 min read

Where more of the execution time takes place is where the most is to be gained through optimization. The previous post detailed a multi-dimensional array implementation in C++ with impressively low storage overhead. The performance was lower than more conventional implementations, but there may be ways to optimize it. However, there is no optimization without benchmarking; one must know where execution time may be most reduced. Hours could be spent optimizing a portion of code that only contributes to 1% of the execution time. Meanwhile, another portion of code could contribute to 30% of the execution time.

Benchmarking Code

The following video by The Cherno on YouTube goes over a simple yet adequate method for measuring the execution time of code. This method will be used to benchmark the performance of the two multi-dimensional array implementations. The methods leverages constructors and destructors and variable scopes to easily and cleanly capture execution times of code. The second video shows how to visually present the benchmark data.

A very important point is that benchmarking must be performed on code compiled in Release mode. This is because Release mode optimizes the code and represents the final application performance.

Testing Code

The test compares the traditional code for a multi-dimensional array to the MultArray code with chained bracket operators. The test creates two 3D arrays consisting of 1,000,000 4x4 matrices of float data. Then, it creates a third array of the same size and stores the product of multiplying the matrices in the first two arrays. The code reports the time to initiate the arrays and measures the total size of one of the arrays. Lastly, it reports the time for the multiplication. Note that the test could be more optimized to work around the performance overhead of the MultArray, but the goal of this test is to expose the performance overhead of MultArray.

int main()
{
 	const size_t rows = 1000000U;

	//
 	// MultArray[][][]
 	//
	{	//Scope for MultArray
 		estd::MultArray<float, 3> Matrices1;
 		estd::MultArray<float, 3> Matrices2;
 		size_t size_Matrix1 = 0;

 		{	//Scope for MultArray initialization testing
 			estd::Timer timer("MultArray[i][j][k] - Initiation");

 			Matrices1.setDimLengths(rows, 4U, 4U);
 			Matrices2.setDimLengths(rows, 4U, 4U);

 			for (size_t i = 0; i < rows; i++)
 			{
 				for (size_t j = 0; j < 4; j++)
 				{
 					for (size_t k = 0; k < 4; k++)
 					{
 						Matrices1[i][j][k] = 3.14159f * i * j * k;
 						Matrices2[i][j][k] = 1.41421f * i * j * k;

 						size_Matrix1 += sizeof(Matrices1[i][j][k]);
 					}
 				}
 			}
 		}
 		estd::log("MultArray is " + std::to_string((sizeof(Matrices1) + size_Matrix1) / (1024 * 1024)) + std::string(" MiB in size."));

 		std::cout << std::endl;

 		{	//Scope for MultArray performance testing
			estd::MultArray<float, 3> Product;
 			Product.setDimLengths(rows, 4U, 4U);

 			estd::Timer timer("MultArray[i][j][k] - Performance");

 			for (size_t i = 0; i < rows; i++)
 			{
 				for (size_t j = 0; j < 4; j++)
 				{
 					for (size_t k = 0; k < 4; k++)
 					{
 						Product[i][j][k] = 0;
 						for (size_t n = 0; n < 4; n++)
 						{
 							Product[i][j][k] += Matrices1[i][n][k] * Matrices2[i][j][n];
 						}
 					}
 				}
 			}
 		}
 		std::cout << std::endl;
 		std::cout << "--------------------------------" << std::endl;
 	}

 //
 // Traditional
 //
 {	//Scope for Traditional
 	float*** MatricesA = new float** [rows];
 	float*** MatricesB = new float** [rows];
 	size_t size_MatrixA = 0;

 	{	//Scope for Traditional initialization testing
		 estd::Timer timer("Typename***[i][j][k] - Initiation");

 		for (size_t i = 0; i < rows; i++)
 		{
 			MatricesA[i] = new float* [4];
 			MatricesB[i] = new float* [4];

 			size_MatrixA += sizeof(MatricesA[i]);

 			for (size_t j = 0; j < 4; j++)
 			{
 				MatricesA[i][j] = new float[4];
 				MatricesB[i][j] = new float[4];

 				size_MatrixA += sizeof(MatricesA[i][j]);

 				for (size_t k = 0; k < 4; k++)
 				{
 					MatricesA[i][j][k] = 3.14159f * i * j * k;
 					MatricesB[i][j][k] = 1.41421f * i * j * k;

 					size_MatrixA += sizeof(MatricesA[i][j][k]);
 				}
 			}
 		}
 	}
 	estd::log("TraditionalArray is " + std::to_string((sizeof(MatricesA) + size_MatrixA) / (1024 * 1024)) + std::string(" MiB in size."));

 	std::cout << std::endl;

 	{	//Scope for Traditional performance testing
 		float*** Product = new float** [rows];
 		for (size_t i = 0; i < rows; i++)
 		{
 			Product[i] = new float* [4];

 			for (size_t j = 0; j < 4; j++)
 			{
 				Product[i][j] = new float[4];
 			}
 		}

 		estd::Timer timer("Typename***[i][j][k] - Performance");

 		for (size_t i = 0; i < rows; i++)
 		{
 			for (size_t j = 0; j < 4; j++)
 			{
 				for (size_t k = 0; k < 4; k++)
 				{
 				Product[i][j][k] = 0;
 				for (size_t n = 0; n < 4; n++)
					{
						Product[i][j][k] += MatricesA[i][n][k] * MatricesB[i][j][n];
					}
 				}
 			}
 		}
 	}
 		std::cout << std::endl;
 	}

 	std::cin.get();
}

Initial Results

The benchmarking times varied, but this is the console output of a single run.

MultArray[i][j][k] - Initiation: 365ms
MultArray is 61 MiB in size.
MultArray[i][j][k] - Performance: 1495ms
--------------------------------
Typename***[i][j][k] - Initiation: 1137ms
TraditionalArray is 80 MiB in size.
Typename***[i][j][k] - Performance: 137ms

After five executions, these are the execution times.

MultArray[i][j][k] - Initiation: 375±5ms

MultArray[i][j][k] - Performance: 1490±8ms

Typename***[i][j][k] - Initiation: 1128±6ms

Typename***[i][j][k] - Performance: 137±2ms

The initiation time for MultArray is significantly faster, one third of the time. However, the execution time is a staggering 11 times the execution time. That is unacceptable, and it results from the overhead of chaining the bracket operators.

Analysis

The chain bracket operator is based upon the following code snippets. To be more clear, the code has been modified so that _Dims is consistent between all snippets as if they were in the same scope.

MultArrayDimension<T, _Dims, _Dims – 1U>& MultArray<T, _Dims>::operator[](size_t index)
{
 setIndex(1U, index);
 return returnObj;
}

void MultArray<T, _Dims>::setIndex(size_t dim, size_t index)
{
 m_index = index * m_dimElementProducts[dim];
}

MultArrayDimension<T, _Dims, _Dim - 2U>& MultArrayDimension<T, _Dims, _Dims – 1U>::operator[](size_t index)
{
 m_MultArray->addIndex(_Dims - _Dim + 1U, index);
 return returnObj;
}

void MultArray<T, _Dims>::addIndex(size_t dim, size_t index)
{
 m_index += index * m_dimElementProducts[dim];
}

T& MultArrayDimension<T, _Dims, 1U>::operator[](size_t index)
{
 m_MultArray->addIndex(_Dims, index);
 return m_MultArray->getData(m_MultArray->getIndex());
}

Once per bracket operator, the value of m_index is being read and modified. This requires a fair amount of jumping around in memory. Therefore, in the very least, the bracket operators must be replaced.

Optimization

Here are some methods that were added to the MultArray class to implement a single operator that accepts all three indices for a 3D array. This code still necessitates multiple operations to expand the parameter pack. But, it simplifies the instructions some.

template<typename... Args>
T& MultArray::operator()(Args... args)
{
	size_t count = 0;
	m_index = buildIndex(count, args...);
	return m_data[m_index];
}

template<typename... Args>
size_t MultArray::buildIndex(size_t &count, size_t &index, Args&... args)
{
	count++;
	return index * m_dimElementProducts[count] + buildIndex(count, args...);
}
size_t MultArray::buildIndex(size_t &count) { return 0U; }

Benchmarking this code yields the following execution times.

MultArray(i, j, k) - Initiation: 41±0ms

MultArray(i, j, k) - Performance: 349±8ms

Typename***[i][j][k] - Initiation: 1167±20ms

Typename***[i][j][k] - Performance: 136±1ms

This is significantly faster than the chained bracket operators. It initializes very quickly, and it only takes 2.6 times the execution time of the traditional multi-dimensional array setup in the performance test. However, it still has a fair amount of overhead.

Maybe, the overhead may be decreased by specifying an operator overload for a given number of dimensions. The new operator overload may be seen below.

template<typename... Args>
T& MultArray::operator()(size_t n1, size_t n2, size_t n3)
{
 return m_data[n1 * m_dimElementProducts[1] + n2 * m_dimElementProducts[2] + n3];
}

Benchmarking this further revised code yields even worse results than before. After some testing, it was discovered that the overhead of the initialization reduced the performance of reading and writing to the array far more. Therefore, all the old functionality for bracket operation chaining was removed. Moreover, it was found that using recursive method calls to process parameter packs was faster then hardcoding a specific number of parameters. The code became far more simple, as seen below.

template<typename T, size_t _Dims>
class MultArray
{
private:
 T* m_data = nullptr;
 size_t m_dimLengths[_Dims];
 size_t m_dimElementProducts[_Dims + 1U];

public:
 template<typename... Args>
 MultArray(Args... args)
 {
  size_t index = 0;
  setLengths(index, args...);
  m_dimElementProducts[_Dims] = 1U;
  for (int n = _Dims - 1U; n > -1; n--)
  {
   m_dimElementProducts[n] = m_dimElementProducts[n + 1] * m_dimLengths[n];
  }
   m_data = new T[m_dimElementProducts[0U]];
 }

 virtual ~MultArray()
 {
  delete[] m_data;
  m_data = nullptr;
 }

 const size_t getLength(size_t dim) const
 {
  return m_dimLengths[dim];
 }

 const size_t getDataLength() const
 {
  return m_dimElementProducts[0U];
 }

 T& operator[](size_t index)
 {
  return m_data[index];
 }

 template<typename... Args>
 T& operator()(Args... args)
 {
  size_t count = 0;
  return m_data[buildIndex(count, args...)];
 }

private:
 template<typename... Args>
 size_t buildIndex(size_t& count, size_t& index, Args&... args)
 {
  count++;
  return index * m_dimElementProducts[count] + buildIndex(count, args...);
 }
 size_t buildIndex(size_t& count) { return 0U; }

 template<typename... Args>
 void setLengths(size_t& index, size_t& length, Args&... args)
 {
  m_dimLengths[index] = length;
  index++;
  setLengths(index, args...);
 }
 void setLengths(size_t& index) {}
}

This code initializes in less than 3% the time of the traditional multi-dimensional array. And, it decreased the execution time of the performance test to about half that of the traditional array. And, the most surprising insight from the testing was that the recursive method calls were faster than hardcoding for a fixed number of parameters. But, the constructor impacting the performance test results more than the overloaded operator was very anti-intuitive.

This anti-intuitive behavior is the result of compiler optimization in Release mode, which is why benchmarking in Release mode is so important. When run in Debug mode, this final code takes over twice as long to initialize as the traditional multi-dimensional array implementation and roughly five times as long to execute the performance test. Although it was not tested, hardcoding the number of parameters may have likely been faster in Debug mode. But, it would have been slower in Release mode.

Debug mode did not just slow down the MultArray. In Debug mode, the traditional implementation took over three times as long to initialize and took nearly ten times as long to complete the performance test when compared to the same code in Release mode.

Take Away

The performance impacts of the coding decisions would not have been obvious without benchmarking. Simply hard-coding the original parentheses operator for three dimensions may have seemed like enough. But, in reality, it decreased performance. Even faster performance, more than acceptable performance, was achieved by retaining the generality while simplifying the structure of the class and optimizing its instructor.

Ultimately, the optimization was successful as the final MultArray class is a general-purpose tool that performs significantly faster than a traditional multi-dimensional array structure. And, it performs faster while being far more memory-efficient. It would not have been possible to achieve this without benchmarking. And, it leaves a solid foundation for finishing the class with debugging checks and other features such as resizing functionality and a copy constructor.

AUSTIN PRESCOTT, MSME