## The theory and practice of concurrency

Although not linear i. Let us see what a speedup curve can tell us about our parallel Fibonacci program. We need to first get some data. The following command performs a sequence of runs of the Fibonacci program for varying numbers of processors. You can now run the command yourself. If successful, the command generates a file named plots. The output should look something like the plot in speedup plot below. The plot shows that our Fibonacci application scales well, up to about twenty processors. As expected, at twenty processors, the curve dips downward somewhat.

We know that the problem size is the primary factor leading to this dip. How much does the problem size matter? The speedup plot in the Figure below shows clearly the trend. The run time that a given parallel program takes to solve the same problem can vary noticeably because of certain effects that are not under our control, such as OS scheduling, cache effects, paging, etc. We can consider such noise in our experiments random noise. Noise can be a problem for us because noise can lead us to make incorrect conclusions when, say, comparing the performance of two algorithms that perform roughly the same.

To deal with randomness, we can perform multiple runs for each data point that we want to measure and consider the mean over these runs. The prun tool enables taking multiple runs via the -runs argument. Moreover, the pplot tool by default shows mean values for any given set of runs and optionally shows error bars. The documentation for these tools gives more detail on how to use the statistics-related features.

Suppose that, on our processor machine, the speedup that we observe is larger than 40x. It might sound improbable or even impossible. But it can happen. Ordinary circumstances should preclude such a superlinear speedup , because, after all, we have only forty processors helping to speed up the computation. Superlinear speedups often indicate that the sequential baseline program is suboptimal.

This situation is easy to check: just compare its run time with that of the sequential elision. If the sequential elision is faster, then the baseline is suboptimal. Other factors can cause superlinear speedup: sometimes parallel programs running on multiple processors with private caches benefit from the larger cache capacity. These issues are, however, outside the scope of this course. As a rule of thumb, superlinear speedups should be regarded with suspicion and the cause should be investigated.

To put this hunch to the test, let us examine the utilization of the processors in our system. We need to first build a binary that collects and outputs logging data. Each of these fields can be useful for tracking down inefficiencies. In the present case, however, none of the new values shown above are highly suspicious, considering that there are all at most in the thousands. Since we have not yet found the problem, let us look at the visualization of the processor utilization using our pview tool.

To get the necessary logging data, we need to run our program again, this time passing the argument --pview. Every time we run with --pview this binary file is overwritten. To see the visualization of the log data, we call the visualizer tool from the same directory. The output we see on our processor machine is shown in the Figure below.

The window shows one bar per processor. Time goes from left to right. Idle time is represented by red and time spent busy with work by grey. You can zoom in any part of the plot by clicking on the region with the mouse.

1. Roscoe, Theory And Practice Of Concurrency | Pearson!
2. Our Planet:Climate Change and the Cryosphere, May 2007.
3. Introduction to numerical analysis.
4. If You're an Educator.
5. Escaping from predators : an integrative view of escape decisions.

To reset to the original plot, press the space bar. From the visualization, we can see that most of the time, particularly in the middle, all of the processors keep busy. However, there is a lot of idle time in the beginning and end of the run. This pattern suggests that there just is not enough parallelism in the early and late stages of our Fibonacci computation.

We are pretty sure that or Fibonacci program is not scaling as well is it could. What is important is to know more precisely what it is that we want our Fibonacci program to achieve. To this end, let us consider a distinction that is important in high-performance computing: the distinction between strong and weak scaling.

In general, strong scaling concerns how the run time varies with the number of processors for a fixed problem size. Sometimes strong scaling is either too ambitious, owing to hardware limitations, or not necessary, because the programmer is happy to live with a looser notion of scaling, namely weak scaling. In weak scaling, the programmer considers a fixed-size problem per processor. We are going to consider something similar to weak scaling. In the Figure below , we have a plot showing how processor utilization varies with the input size. The scenario that we just observed is typical of multicore systems.

For computations that perform lots of highly parallel work, such limitations are barely noticeable, because processors spend most of their time performing useful work. The utilization plot is shown in the Figure below. We have seen in this lab how to build, run, and evaluate our parallel programs. Concepts that we have seen, such as speedup curves, are going to be useful for evaluating the scalability of our future solutions. Strong scaling is the gold standard for a parallel implementation. But as we have seen, weak scaling is a more realistic target in most cases.

In many cases, a parallel algorithm which solves a given problem performs more work than the fastest sequential algorithm that solves the same problem. This extra work deserves careful consideration for several reasons. First, since it performs additional work with respect to the serial algorithm, a parallel algorithm will generally require more resources such as time and energy.

By using more processors, it may be possible to reduce the time penalty, but only by using more hardware resources. Assuming perfect scaling, we can reduce the time penalty by using more processors. Sometimes, a parallel algorithm has the same asymptotic complexity of the best serial algorithm for the problem but it has larger constant factors.

This is generally true because scheduling friction, especially the cost of creating threads, can be significant. In addition to friction, parallel algorithms can incur more communication overhead than serial algorithms because data and processors may be placed far away in hardware. These considerations motivate considering "work efficiency" of parallel algorithm. Work efficiency is a measure of the extra work performed by the parallel algorithm with respect to the serial algorithm. We define two types of work efficiency: asymptotic work efficiency and observed work efficiency.

The former relates to the asymptotic performance of a parallel algorithm relative to the fastest sequential algorithm. The latter relates to running time of a parallel algorithm relative to that of the fastest sequential algorithm. An algorithm is asymptotically work efficient if the work of the algorithm is the same as the work of the best known serial algorithm. The parallel array increment algorithm that we consider in an earlier Chapter is asymptotically work efficient, because it performs linear work, which is optimal any sequential algorithm must perform at least linear work.

We consider such algorithms unacceptable, as they are too slow and wasteful. We consider such algorithms to be acceptable. We build this code by using the special optfp "force parallel" file extension. This special file extension forces parallelism to be exposed all the way down to the base cases.

Later, we will see how to use this special binary mode for other purposes. In practice, observed work efficiency is a major concern. First, the whole effort of parallel computing is wasted if parallel algorithms consistently require more work than the best sequential algorithms. In other words, in parallel computing, both asymptotic complexity and constant factors matter.

Based on these discussions, we define a good parallel algorithm as follows. We say that a parallel algorithm is good if it has the following three characteristics:. For example, a parallel algorithm that performs linear work and has logarithmic span leads to average parallelism in the orders of thousands with the small input size of one million. For such a small problem size, we usually would not need to employ thousands of processors. It would be sufficient to limit the parallelism so as to feed tens of processors and as a result reduce impact of excess parallelism on work efficiency.

In many parallel algorithms such as the algorithms based on divide-and-conquer, there is a simple way to achieve this goal: switch from parallel to sequential algorithm when the problem size falls below a certain threshold. This technique is sometimes called coarsening or granularity control.

But which code should we switch to: one idea is to simply switch to the sequential elision, which we always have available in PASL. If, however, the parallel algorithm is asymptotically work inefficient, this would be ineffective. In such cases, we can specify a separate sequential algorithm for small instances. Optimizing the practical efficiency of a parallel algorithm by controlling its parallelism is sometimes called optimization , sometimes it is called performance engineering , and sometimes performance tuning or simply tuning. In the rest of this document, we use the term "tuning.

In fact, there is barely a difference between the serial and the parallel runs. The tuning is actually done automatically here by using an automatic-granularity-control technique described in the section. Our parallel program on a single processor is one percent slower than the sequential baseline.

Such work efficiency is excellent. The basic idea behind coarsening or granularity control is to revert to a fast serial algorithm when the input size falls below a certain threshold. To determine the optimal threshold, we can simply perform a search by running the code with different threshold settings. While this approach can help find the right threshold on the particular machine that we performed the search, there is no guarantee that the same threshold would work on another machine.

In fact, there are examples in the literature that show that such optimizations are not portable , i. In the general case, determining the right threshold is even more difficult. To see the difficulty consider a generic polymorphic , higher-order function such as map that takes a sequence and a function and applies the function to the sequence. The problem is that the threshold depends both on the type of the elements of the sequence and the function to be mapped over the sequence. For example, if each element itself is a sequence the sequence is nested , the threshold can be relatively small.

If, however, the elements are integers, then the threshold will likely need to be relatively large. This makes it difficult to determine the threshold because it depends on arguments that are unknown at compile time. Essentially the same argument applies to the function being mapped over the sequence: if the function is expensive, then the threshold can be relatively small, but otherwise it will need to be relatively large.

As we describe in this chapter , it is sometimes possible to determine the threshold completely automatically. There has been significant research into determining the right threshold for a particular algorithm. This problem, known as the granularity-control problem , turns out to be a rather difficult one, especially if we wish to find a technique that can ensure close-to-optimal performance across different architectures.

In this section, we present a technique for automatically controlling granularity by using asymptotic cost functions. In general, the value returned by the complexity function need only be precise with respect to the asymptotic complexity class of the associated computation. The complexity function above is preferable, however, because it is simpler.

In other words, when expressing work costs, we only need to be precise up to the asymptotic complexity class of the work. In PASL, a controlled statement , or cstmt , is an annotation in the program text that activates automatic granularity control for a specified region of code. To support such automatic granularity control PASL uses a prediction algorithm to map the asymptotic work cost as returned by the complexity function to actual processor cycles.

When the predicted processor cycles of a particular instance of the controlled statement falls below a threshold determined automatically for the specific machine , then that instance is sequentialized, by turning off the ability to spawn parallel threads for the execution of that instance. If the predicted processor cycle count is higher than the threshold, then the statement instance is executed in parallel. In other words, the reader can think of a controlled statement as a statement that executes in parallel when the benefits of parallel execution far outweigh its cost and that executes sequentially in a way similar to the sequential elision of the body of the controlled statement would if the cost of parallelism exceeds its benefits.

We note that while the sequential exection is similar to a sequential elision, it is not exactly the same, because every call to fork2 must check whether it should create parallel threads or run sequentially. Thus the execution may differ from the sequential elision in terms of performance but not in terms of behavior or semantics.

The code below uses a controlled statement to automatically select, at run time, the threshold size for our parallel array-increment function. The controlled statement takes three arguments, whose requirements are specified below, and returns nothing i. The effectiveness of the granularity controller may be compromised if any of the requirements are not met.

## Don’t miss the “Concurrency with modern C++” book if you want to master the C++ concurrency.

The first argument is a reference to the controller object. The controller object is used by the controlled statement to collect profiling data from the program as the program runs. The label must be unique to the particular controller. Moreover, the controller must be declared as a global variable. The second argument is the complexity function.

The type of the return value should be long. The third argument is the body of the controlled statement. The return type of the controlled statement should be void. When the controlled statement chooses sequential evaluation for its body the effect is similar to the effect where in the code above the input size falls below the threshold size: the body and the recursion tree rooted there is sequentialized.

When the controlled statement chooses parallel evaluation, the calls to fork2 create parallel threads. It is not unusual for a divide-and-conquer algorithm to switch to a different algorithm at the leaves of its recursion tree. For example, sorting algorithms, such as quicksort, may switch to insertion sort at small problem sizes.

In the same way, it is not unusual for parallel algorithms to switch to different sequential algorithms for handling small problem sizes. Such switching can be beneficial especially when the parallel algorithm is not asymptotically work efficient. To provide such algorithmic switching, PASL provides an alternative form of controlled statement that accepts a fourth argument: the alternative sequential body. This alternative form of controlled statement behaves essentially the same way as the original described above, with the exception that when PASL run time decides to sequentialize a particular instance of the controlled statement, it falls through to the provided alternative sequential body instead of the "sequential elision.

Even though the parallel and sequential array-increment algorithms are algorithmically identical, except for the calls to fork2 , there is still an advantage to using the alternative sequential body: the sequential code does not pay for the parallelism overheads due to fork2. Even when eliding fork2 , the run-time-system has to perform a conditional branch to check whether or not the context of the fork2 call is parallel or sequential. Because the cost of these conditional branches adds up, the version with the sequential body is going to be more work efficient. Another reason for why a sequential body may be more efficient is that it can be written more simply, as for example using a for-loop instead of recursion, which will be faster in practice.

In general, we recommend that the code of the parallel body be written so as to be completely self contained, at least in the sense that the parallel body code contains the logic that is necessary to handle recursion all the way down to the base cases. Put differently, it should be the case that, if the parallelism-specific annotations including the alternative sequential body are erased, the resulting program is a correct program. We recommend this style because such parallel codes can be debugged, verified, and tuned, in isolation, without relying on alternative sequential codes.

Let us add one more component to our granularity-control toolkit: the parallel-for from. By using this loop construct, we can avoid having to explicitly express recursion-trees over and over again. For example, the following function performs the same computation as the example function we defined in the first lecture.

Only, this function is much more compact and readable. Moreover, this code takes advantage of our automatic granularity control, also by replacing the parallel-for with a serial-for. Notice that the code above specifies no complexity function. The reason is that this particular instance of the parallel-for loop implicitly defines a complexity function. The implicit complexity function reports a linear-time cost for any given range of the iteration space of the loop.

In other words, the implicit complexity function assumes that per iteration the body of the loop performs a constant amount of work. Of course, this assumption does not hold in general. If we want to specify explicitly the complexity function, we can use the form shown in the example below. The complexity function is passed to the parallel-for loop as the fourth argument.

The complexity function takes as argument the range [lo, hi. In this case, the complexity is linear in the number of iterations. The function simply returns the number of iterations as the complexity. The following code snippet shows a more interesting case for the complexity function. In this case, we are performing a multiplication of a dense matrix by a dense vector. The outer loop iterates over the rows of the matrix. The complexity function in this case gives to each of these row-wise iterations a cost in proportion to the number of scalars in each column. Matrix multiplication has been widely used as an example for parallel computing since the early days of the field.

There are good reasons for this. First, matrix multiplication is a key operation that can be used to solve many interesting problems. Second, it is an expansive computation that is nearly cubic in the size of the inputit can thus can become very expensive even with modest inputs. Fortunately, matrix multiplication can be parallelized relatively easily as shown above.

The figure below shows the speedup for a sample run of this code. Observe that the speedup is rather good, achieving nearly excellent utilization. While parallel matrix multiplication delivers excellent speedups, this is not common for many other algorithms on modern multicore machines where many computations can quickly become limited by the availability of bandwidth.

Arrays are a fundamental data structure in sequential and parallel computing. When computing sequentially, arrays can sometimes be replaced by linked lists, especially because linked lists are more flexible. Unfortunately, linked lists are deadly for parallelism, because they require serial traversals to find elements; this makes arrays all the more important in parallel computing.

Each one has various pitfalls for parallel use. Therefore, the work and span cost of the call new[n] is n. But we can initialize an array in logarithmic span in the number of items. The STL vector implements a dynamically resizable array that provides push, pop, and indexing operations. The push and pop operations take amortized constant time and the indexing operation constant time. The STL vector also provides the method resize n which changes the size of the array to be n. The resize operation takes, in the worst case, linear work and span in proportion to the new size, n.

In other words, the resize function uses a sequential algorithm to fill the cells in the vector. Such sequential computations that exist behind the wall of abstraction of a language or library can harm parallelism by introducing implicit sequential dependencies. Finding the source of such sequential bottlenecks can be time consuming, because they are hidden behind the abstraction boundary of the native array abstraction that is provided by the programming language. We can avoid such pitfalls by carefully designing our own array data structure.

Because array implementations are quite subtle, we consider our own implementation of parallel arrays, which makes explicit the cost of array operation, allowing us to control them quite carefully. Specifically, we carefully control initialization and disallow implicit copy operations on arrays, because copy operations can harm observable work efficiency their asymptotic work cost is linear. The key components of our array data structure, sparray , are shown by the code snippet below.

An sparray can store bit words only; in particular, they are monomorphic and fixed to values of type long.

1. The Belly Fat Cure Fast Track: Discover the Ultimate Carb Swap and Drop Up to 14 lbs. the First 14 Days.
2. Blood Reaver (Night Lords).
3. A Shorter Commentary on Romans by Karl Barth (Barth Studies).
4. Wheels reinvented, inquire within!

We stick to monomorphic arrays here to simplify the presentation. The class sparray provides two constructors. The first one takes in the size of the array set to 0 by default and allocates an unitialized array of the specified size nullptr if size is 0. The second constructor takes in a list specified by curly braces and allocates an array with the same size.

Since the argument to this constructor must be specified explicitly in the program, its size is constant by definition. The second constructor performs initialization, based on constant-size lists, and thus also has constant work and span. Array indexing: Each array-indexing operation, that is the operation which accesses an individual cell, requires constant work and constant span.

Size operation: The work and the span of accessing the size of the array is constant. The destructor takes constant time because the contents of the array are just bits that do not need to be destructed individually. Move assignment operator: Not shown, the class includes a move-assignment operator that gets fired when an array is assigned to a variable. This operator moves the contents of the right-hand side of the assigned array into that of the left-hand side.

This operation takes constant time. Copy constructor: The copy constructor of sparray is disabled. This prohibits copying an array unintentionally, for example, by passing the array by value to a function. This program below shows a basic use sparray 's. The first line allocates and initializes the contents of the array to be three numbers.

The second uses the familiar indexing operator to access the item at the second position in the array. The third line extracts the size of the array. The fourth line assigns to the second cell the value 5. The fifth prints the contents of the cell. Just after creation, the array contents consist of uninitialized bits.

We use this convention because the programmer needs flexibility to decide the parallelization strategy to initialize the contents. Internally, the sparray class consists of a size field and a pointer to the first item in the array. The contents of the array are heap allocated automatically by constructor of the sparray class. We give several examples of this automatic deallocation scheme below. In the function below, the sparray object that is allocated on the frame of foo is deallocated just before foo returns, because the variable xs containing it goes out of scope.

Care must be taken when managing arrays, because nothing prevents the programmer from returning a dangling pointer. It is safe to take a pointer to a cell in the array, when the array itself is still in scope. For example, in the code below, the contents of the array are used strictly when the array is in scope. Preface The goal of this book is to cover the fundamental concepts of parallel computing, including models of computation, parallel algorithms, and techniques for implementing and evaluating parallel algorithms. Administrative Matters Course combines theory and practice.

We will try ask the following two questions.

Does it work in practice? Does it work in theory? Preliminaries 3. Introduction This class is motivated by recent advances in architecture that put sequential hardware on the path to extinction. Multithreaded computation Work and span Offline scheduling Structured or implicit parallel computation Fork-join, async-finish, nested parallelism. Parallelism versus concurrency. Chapter: Multithreading, Parallelism, and Concurrency The term multithreading refers to computing with multiple threads of control.

DAG Representation A multithreaded computation can be represented by a dag, a Directed Acyclic Graph, or written also more simply a dag , of vertices. Throughout this book, we make two assumptions about the structure of the dag:. Each vertex has outdegree at most two. Cost Model: Work and Span For analyzing the efficiency and performance of multithreaded programs, we use several cost measures, the most important ones include work and span. Definition: Execution Schedule.

The length of a schedule is the number of steps in the schedule. An example schedule with 3 processes. Fact: Scheduling Invariant. Scheduling Lower Bounds Theorem: Lower bounds. Offline Scheduling Having established a lower bound, we now move on to establish an upper bound for the offline scheduling problem , where we are given a dag and wish to find an execution schedule that minimizes the run time. Theorem:[Offline level-by-level schedule].

### If You're a Student

Theorem: Offline Greedy Schedule. Online Scheduling In offline scheduling, we are given a dag and are interested in finding a schedule with minimal length. Typical Online Scheduling Algorithm. Exercise: Scheduling Invariant. Convince yourself that the scheduling invariant holds in online scheduling.

## TK8112 - The Theory of Concurrency in Real-Time Systems

Parallelism versus concurrency Structured multithreading offers important benefits both in terms of efficiency and expressiveness.

## My Book "Concurrency with Modern C++" is 50% complete - hiqukycona.tk

From the past terms such as "sequential programming" and "parallel programming" are still with us, and we should try to get rid of them, for they are a great source of confusion. They date from the period that it was the purpose of our programs to instruct our machines, now it is the purpose of the machines to execute our programs. Whether the machine does so sequentially, one thing at a time, or with considerable amount of concurrency, is a matter of implementation, and should not be regarded as a property of the programming language.

Chapter: Fork-join parallelism Fork-join parallelism, a fundamental model in parallel computing, dates back to and has since been widely used in parallel computing. Parallel Fibonacci Now, we have all the tools we need to describe our first parallel code: the recursive Fibonacci function. Incrementing an array, in parallel Suppose that we wish to map an array to another by incrementing each element by one.

The sequential elision In the Fibonacci example, we started with a sequential algorithm and derived a parallel algorithm by annotating independent functions. Executing fork-join algorithms We defined fork-join programs as a subclass case of multithreaded programs. Figure 3. Example 7. Centralized scheduler illustrated: the state of the queue and the dag after step 4. Completed vertices are drawn in grey shaded.

Example 8. Distributed scheduler illustrated: the state of the queue and the dag after step 4. Chapter: Structured Parallelism with Async-Finish The "async-finish" approach offers another mechanism for structured parallelism.