Multicore Applications

Edit this on GitLab

Multi-Core Applications

The following guide was created using a 75G5

Summary

Multi-Core applications are applications that use multiple cores on a CPU

There are 2 main ways to create a multi-core application

Multi-threading

Multithreading creates multiple threads within a single process to boost computing speed.

Threads share the same memory space and resources within a process.
Threads execute concurrently, improving throughput.
Process creation in multithreading is more economical.
Example: A web server handling multiple client requests concurrently using threads.

Multi-processing (Forking)

Multiprocessing involves using two or more CPUs to enhance system performance.

Multiple processes run simultaneously, each with its own memory space and resources.
It increases CPU utilization by distributing tasks across different processors.
Process creation is time-consuming.
Example: Rendering frames in a video editing software using separate CPU cores for each frame.

Context Switching

Suspension and State Storage: When the operating system decides to switch from executing one process (or thread) to another, it suspends the current process. It saves the CPU’s state (context) for that process in memory. This context includes information like register values, program counter, and other relevant data.
Context Retrieval: The OS then retrieves the context of the next process (or thread) that needs to run. This context was previously stored in memory during a previous context switch.
Resuming Execution: Finally, the OS restores the retrieved context into the CPU’s registers. It sets the program counter to the location where the process was interrupted, allowing it to resume execution from that point.

Context switching allows efficient multitasking by allowing the CPU to switch between different processes or threads, maintaining their individual states.

Applications

Here are some examples of applications:

Multithreading:
- GUI Applications: Improve responsiveness by handling UI events in separate threads.
- File I/O: Read/write files concurrently without blocking the main thread.
- Network Servers: Handle multiple client connections simultaneously.
- Parallel Data Processing: Divide large datasets into smaller chunks for parallel processing.
- Process Failure Testing: Multi-Threading can be used to make sure that a process is operating correctly without hurting the performance of the process by testing it on a separate core.
Multiprocessing:
- Image Processing: Transformations to images using multiple processes.
- Scientific Computing: Solve complex mathematical problems by distributing computations across processors.
- Video Encoding: Encode videos faster by splitting frames across multiple processes.
- Simulation: Simulate real-world scenarios (e.g., physics simulations) using parallel processes.
- Data Parallelism: Train machine learning models on subsets of data in parallel.

Advantages

Multi-core applications can significantly improve the performance of a application by splitting up threads / processes between cores
Multi-core applications allow for testing without effecting the performance of the application by running tests on a separate core

Project Guide

Overview

We will be creating 3 different applications

The first 2 applications are the same, just use different forms of multi-core applications, it this case it doesn’t matter which you use as you will get the same result either way.
- A multi-threaded sorting algorithm that sorts 2 identical 10,000 integer lists
- A multi-processed sorting algorithm that sorts 2 identical 10,000 integer lists
  
  Here’s an example of what this looks like
The third one is a state machine that is set to fail at a specific state, and is detected by tester in a thread running on a different core
- This will use multi-threading as it requires shared memory to detect failures
  
  Here’s an example of the third sample app:

What you will learn

How to create a multi-threaded application
How to create a multi-processed application
Reasons to pick either multi-processing or multi-threading
How to synchronize multi-threaded applications

Requirements

Required Hardware:

Any NAI Multi-Core Board

For this guide I used a 75G5 as it uses a dual-core processor (specifically the Xilinx Zync 7015)

Xilinx Setup

Note	This is the setup for a empty project, If you are using an NAI SSK, please follow the proper project setup for your SSK version.

Press "Browse"
Pick a location for your workspace, then create a new folder with your workspace name, Then press "Ok"
Expand your workspace

Setup your application

Right click on the empty space "C/C++ Projects" tab on the left size of the screen
Select New → Project..
Select "C Project", Then click "Next"
Now pick a project name, then in "Project Type:" under "Makefile Project" select Empty Project

Note

Your toolchain depends on your board’s processor. See board specific documentation for the correct option. The shown option is for a 75G5 (Xilinx Zync 7015)
Now right click on the project and click "properties"
In "C/C++ Build" turn on "Generate Makefiles Automatically"
Now right click on your project click New → Source File
You can name it whatever you want, For this example I am naming it "main.c"

Create a main method, I am using the following code

#include <stdio.h>
#include <stdlib.h>

int32_t main(void)
{
    printf("Hello World\n");
    return 0;
}

When you are done, press CTRL+B to build the project
Now go back to project properties (Right click project, the "properties") and turn of generate make files

Now we have to make sure the the compiler uses the "pthreads" library for multi-threading

In the "C/C++ Projects" tab, open the default folder of your project, then open the makefile

The make file should contain the following lines

Multicore_Multiprocessed: $(OBJS)  $(USER_OBJS)
    @echo 'Building target: $@'
    @echo 'Invoking: ARM Linux gcc linker'
    arm-xilinx-linux-gnueabi-gcc  -o "Multicore_Multiprocessed" $(OBJS) $(USER_OBJS) $(LIBS)
    @echo 'Finished building target: $@'
    @echo ' '

In the following line

arm-xilinx-linux-gnueabi-gcc  -o "arm-xilinx-linux-gnueabi-gcc -pthread  -o "Multicore_Multiprocessed" $(OBJS) $(USER_OBJS) $(LIBS)

add -pthread after arm-xilinx-linux-gnueabi-gcc, so the section of code should look like

Multicore_Multiprocessed: $(OBJS)  $(USER_OBJS)
    @echo 'Building target: $@'
    @echo 'Invoking: ARM Linux gcc linker'
    arm-xilinx-linux-gnueabi-gcc -pthread  -o "Multicore_Multiprocessed" $(OBJS) $(USER_OBJS) $(LIBS)
    @echo 'Finished building target: $@'
    @echo ' '

Now you can start making your application

Sample Apps

All the sample apps will use the following base template:

#define _GNU_SOURCE
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <ctype.h>
#include <stdbool.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>

int32_t main(void)
{
    //code goes here
}

Sorting

For the 2 sorting applications we will use some of the same functions:

Let’s start with a bubble sort function. Bubble sort is slow, which helps visualize the processing time

//Bubble Sort Helper
void swap(int* xp, int* yp)
{
    int temp = *xp;
    *xp = *yp;
    *yp = temp;
}

void bubbleSort(int arr[], int n)
{
    int i, j;
    bool swapped;
    for (i = 0; i < n - 1; i++)
    {
        swapped = false;
        for (j = 0; j < n - i - 1; j++)
        {
            if (arr[j] > arr[j + 1])
            {
                swap(&arr[j], &arr[j + 1]);
                swapped = true;
            }
        }
        if (swapped == false)
            break;
    }
}

This is optional but a print array function can be useful for testing

// Function to print an array
void printArray(int arr[], int size)
{
    int i;
    for (i = 0; i < size; i++)
    {
        printf("%d ", arr[i]);
    }
}

Multi-processed Sorting

To start with the multi-processed sorting, add the following variables and macros that will be used in the application

#define TIME (int)time(NULL) - startTime

# define CPU_ZERO(cpusetp)     __CPU_ZERO_S (sizeof (cpu_set_t), cpusetp)
# define CPU_SET(cpu, cpusetp)     __CPU_SET_S (cpu, sizeof (cpu_set_t), cpusetp)

bool bQuit = false;

To have multiple processes we need to create a function that each process will run. In this process we will be creating our array and sorting it, while keeping track of the time.

/*
 * Work of a single process.
 */
void ProcessWork(int core)
{
    cpu_set_t  mask; //A CPU Set keeps track of cpu cores
    CPU_ZERO(&mask); //This sets the spot in memory of a CPU set to zero
    CPU_SET(0, &mask); // This sets a CPU set to the specified core
    int result = sched_setaffinity(core, sizeof(mask), &mask); // This sets teh core
    if (result != 0)
    {
        printf("Failed to set affinity\n");
        return;
    }

    int startTime = (int)time(NULL);
    // Keeps track of the start time

    // Create an array of integers from 10000 - 1
    int arr[10000];
    srand(0);
    int i;
    for (i = 10000; i > 0; i--) {
        arr[10000 - i] = i;
    }
    int n = sizeof(arr) / sizeof(arr[0]);

    bubbleSort(arr, n); // Sort the array
    printf("Sorted array created: \n");

    // This code gets the current core
    cpu_set_t setcpuset;
    CPU_ZERO(&setcpuset);
    CPU_SET(0, &setcpuset);
    sched_getaffinity(0, sizeof(cpu_set_t), &setcpuset);

    fprintf(stdout, "Timestamp: %d seconds", TIME);
    printf(" on core %d\n", setcpuset);

    return (void*)0;
}

Now that we created a function for what a process does, we have to create the processes, we will do this in the main function.

int main(int argc, const char * argv[])
{
    int i = 0;
    while (!bQuit) //this loops runs 30 times
    {
        printf("\n set: %d\n", i); //this prints which set of processes we're at
        pid_t pid[2];
        int core1 = 0;
        int core2 = 1;

        pid_t pid1 = fork(); //fork() creates the fork and enters it
        if(pid1 == 0)
        {
            pid[1] = pid1; //ads the fork to our array of processes
            ProcessWork(core1);
            exit(0); //this exits the fork
        }
        pid_t pid2 = fork();
        if(pid2 == 0)
        {
            pid[2] = pid2;
            ProcessWork(core2);
            exit(0);
        }

        int j; //this waits for both processes to be done before continuing on
        for(j = 0; j < 2; j++)
        {
            wait(&pid[j]); //example of context switching
        }
        if (i == 29)
        {
            bQuit = true;
        }
        i++;
    }
    return 0;
}

Multi-threaded Sorting

Similar to the multi-processed version, we need to declare our variables and macros.

#define TIME (int)time(NULL) - startTime

# define CPU_ZERO(cpusetp)     __CPU_ZERO_S (sizeof (cpu_set_t), cpusetp)
# define CPU_SET(cpu, cpusetp)     __CPU_SET_S (cpu, sizeof (cpu_set_t), cpusetp)

bool bQuit = false;
pthread_mutex_t lock; //the lock is optional, it locks out other threads from using the shared memory

Now we need to create a function to select a core

int selectCore(int core_id)
{
    const int num_cores = sysconf(_SC_NPROCESSORS_ONLN); //gets num of cores
    if (core_id < 0 || core_id >= num_cores)
        return -1;

    cpu_set_t cpuset; //creates cpu set
    CPU_ZERO(&cpuset); //zeros the cpu set
    CPU_SET(core_id, &cpuset); //set the core of the cpu set

    pthread_t current_thread = pthread_self(); //gets the current thread
    int toReturn = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
    //sets the core for the thread to run at
    return toReturn;
}

Now we need to make the function that each thread will run. The commented out lines are optional.

void* runThread(void* data)
{
//    pthread_mutex_lock(&lock); OPTIONAL (slower): Enable the lock on shared memory
    int startTime = (int)time(NULL); //start reading time
    int* corePtr = (int*)data; //integer of what core to use
    if (!corePtr) //check null pointer
    {
        printf("Channel Doesn't Exist\n");
        return (void*)1;
    }
    int core_id = *corePtr;
    // Sets the core of the thread
    if (selectCore(core_id) != 0)
    {
        printf("Failed to set thread affinity for channel %d\n", core_id);
        return (void*)1;
    }

    //Gets the current core being used
    pthread_t current_thread = pthread_self();
    cpu_set_t setcpuset;
    CPU_ZERO(&setcpuset);
    CPU_SET(core_id, &setcpuset);
    pthread_getaffinity_np(current_thread, sizeof(cpu_set_t), &setcpuset);


    //create the array
    int arr[10000];
    srand(0);
    int i;
    for (i = 10000; i > 0; i--)
    {
        arr[10000 - i] = i;
    }
    int n = sizeof(arr) / sizeof(arr[0]);

    bubbleSort(arr, n);
    printf("Sorted array created: \n");
    //    printArray(arr, n);

    fprintf(stdout, "Timestamp: %d seconds", TIME);
    printf(" on core %d\n", setcpuset);

//    pthread_mutex_unlock(&lock); Unlock the shared memory lock

    return (void*)0;
}

Now we just need the main method to create the threads

int32_t main(void)
{
    int i = 0;
    while (!bQuit)
    {
        printf("\n set: %d\n", i);
        int core1 = 0;
        int core2 = 1;
        pthread_t th1, th2;
        //create the threads
        pthread_create(&th1, NULL, runThread, &core1);
        pthread_create(&th2, NULL, runThread, &core2);
        //wait for the threads to be complete before continuing
        pthread_join(th1, NULL);
        pthread_join(th2, NULL);
        //this is an example of context switching
        if (i == 29)
        {
            bQuit = true;
        }
        i++;
    }
    return 0;
}

And now you have multi-threaded sorting algorithm

Multi-threaded Failure Detection

For multi-threaded failure detection we need teh following variables and macros

#define TIME (int)time(NULL) - startTime

# define CPU_ZERO(cpusetp)     __CPU_ZERO_S (sizeof (cpu_set_t), cpusetp)
# define CPU_SET(cpu, cpusetp)     __CPU_SET_S (cpu, sizeof (cpu_set_t), cpusetp)


//shared memory
bool bQuit = false;
int currentState = 1; //keeping track of states
pthread_t th1, th2; //Declaring this as a global variable so we can cancel the thread

Note	We will be using the sorting algorithm that we made earlier for this application

Now we need to add a function to select a core. We will be using the same from the Multi-core Sorting application

int selectCore(int core_id)
{
    const int num_cores = sysconf(_SC_NPROCESSORS_ONLN); //gets num of cores
    if (core_id < 0 || core_id >= num_cores)
        return -1;

    cpu_set_t cpuset; //creates cpu set
    CPU_ZERO(&cpuset); //zeros the cpu set
    CPU_SET(core_id, &cpuset); //set the core of the cpu set

    pthread_t current_thread = pthread_self(); //gets the current thread
    int toReturn = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
    //sets the core for the thread to run at
    return toReturn;
}

Next we need to create a function to run the sorting. This function will take the parameters for cpuset and state for logging purposes so we can make sure the application is working properly

void* RunSorting(cpu_set_t* setcpuset, int state)
{
    int startTime = (int)time(NULL);
    int arr[10000];
    srand(0);
    int i;
    for (i = 10000; i > 0; i--)
    {
        arr[10000 - i] = i;
    }
    int n = sizeof(arr) / sizeof(arr[0]);
    bubbleSort(arr, n);
    printf("Sorted array created: \n");
    printf(stdout, "Timestamp: %d seconds", TIME);
    // The reason the printf statements are separate like this is because the cpuset causes issues with printing as it can have multiple indexes
    printf(" on core %d ran on state ", *setcpuset);
    printf("%d\n", state);
    return (void*)0;
}

Now we need to create our main program, this will set a core and switch between states.

void* mainProgram(void* data)
{
    //set the core of the thread
    int* corePtr = (int*)data;
    if (!corePtr)
    {
        printf("Channel Doesn't Exist\n");
        return (void*)1;
    }
    int core_id = *corePtr;
    // Set thread affinity to core based on channel number
    if (selectCore(core_id) != 0)
    {
        printf("Failed to set thread affinity for channel %d\n", core_id);
        return (void*)1;
    }

    //keep track of the current core for logging
    pthread_t current_thread = pthread_self();
    cpu_set_t setcpuset;
    CPU_ZERO(&setcpuset);
    CPU_SET(core_id, &setcpuset);
    pthread_getaffinity_np(current_thread, sizeof(cpu_set_t), &setcpuset);
    int i = 0;

    //switch between the states
    while (bQuit == false)
    {
        //state 1
        currentState = 1;
        RunSorting(&setcpuset, 1);
        //state 2
        currentState = 2;
        if (i != 2) //don't run state 2 after 3rd cycle to test error detection
        {
            RunSorting(&setcpuset, 2);
        }
        i++;
    }
}

Next we have to create the state error testing code that will run on the other core

void* stateTest(void* data)
{
    //set core of the thread
    int* corePtr = (int*)data;
    if (!corePtr)
    {
        printf("Channel Doesn't Exist\n");
        return (void*)1;
    }
    int core_id = *corePtr;
    // Set thread affinity to core based on channel number
    if (selectCore(core_id) != 0)
    {
        printf("Failed to set thread affinity for channel %d\n", core_id);
        return (void*)1;
    }

     //create local variables
    bool stateFail = false;
    int lastState = 1;
    int startTime = (int)time(NULL);

    //loop until program closes
    while(bQuit == false)
    {
        //detect state change
        if (lastState != currentState)
        {
            //log failure if it changed too quickly
            if (TIME < 2) //NOTE: THIS NUMBER IS FOR 75G5 PROCESSING TIME
            {
                stateFail = true;
            }
            startTime = (int)time(NULL);
            lastState = currentState;
        }

        //Log failure if processing time is too long
        if (TIME > 6) //NOTE: THIS NUMBER IS FOR 75G5 PROCESSING TIME
        {
            stateFail = true;
        }

        //If failure is detected cancel main thread and close program
        if (stateFail == true)
        {
            printf("FAILURE DETECTED\n");
            pthread_cancel(th1);
            bQuit = true;
        }
        else
        { //This is optional, I added it for visualisation, but it just tells us that it's working properly
            printf("Working Properly\n");
        }
        //check for errors every seconds
        sleep(1);
    }
}

The main method is pretty simple, just creating the the threads, assigning what core to put them in, and waiting for the threads to finish executing

int32_t main(void)
{
    int core1 = 0;
    int core2 = 1;
    pthread_create(&th1, NULL, mainProgram, &core1);
    pthread_create(&th2, NULL, stateTest, &core2);
    pthread_join(th1, NULL);
    pthread_join(th2, NULL);
    return 0;
}

And now you have a simple application the tests for a state machine’s failure based on execution time.

Experimenting with the sample apps

Try changing around parameters on the sample applications. Put multiple processes or threads on 1 core, and see how processing speed is effected, etc. You should see a reduction in processing speed when doing this. Here are some examples

Multi-threading

2 threads on 1 core

2 threads on 2 cores

Multi-processing

2 processes on 1 core

2 processes on 2 cores

Note	There is a glitch here where hhe core number is wrong, but if you use the "top" command in linux, you will see it’s assigning the core correctly. I couldn’t figure out why this is happening

Conclusion

There are 2 ways to handle multi-core applications
- Multi-threading
  - Multi-threading has shared memory
- Multi-processing
  - Multi-processing has separated memory
  - Multi-processing is good for avoiding memory conflicts between processes
Multi-core applications dramatically improve processing speeds

Integrator Resources

Multicore Applications