Guide to Computer Architecture and System Design--PARALLEL PROCESSING AND FAULT-TOLERANCE (part 1)

Home | Forum | DAQ Fundamentals | DAQ Hardware | DAQ Software

Input Devices
| Data Loggers + Recorders | Books | Links + Resources


With most of the sequential computers adopting the John Von Neumann stored program concept, the organization is of the SISD (single instruction single data) scheme.

To improve the processor utility, the pipelining and extra CPU - in storage shall definitely produce a smooth flow. But the need for shorter program times of a single job and a higher throughput in multitasking on a Single CPU bench called for virtual memory management and resource sharing techniques. Still with the identification of sensitive program spots for vectorization, the scalar type of instruction can never be eliminated of a program. Thus, the data flow approach towards parallel activities on a processor unit containing multiple functional units repeated many in number ( like SIMD ) will definitely pave the way for real speedup on enhanced pipelined approach. This technique of reducing the turn around times of jobs involves a new concept in data flowing across the machine.

The data flow model of this kind is explained in section 2.

In situations where the targeted turn around times can only be achieved at the expense of multiple processing units (CPUs) of the same or varying structures, the MIMD system domain has to cater to form a tightly-coupled compute system meeting the dual aspects of dataflow and distributed processing. This type of environment needs parallel language constructs, semaphores for collision detection towards resource sharing and multibus protocols with a good connectivity base. Already the transputers technology has achieved momentum in this particular area of parallel machine towards the super computing ends. In most of the establishments like companies, workshops, academic institutions etc. the real time data processing has to be done in a distributed way with a functional approach of each mode. This essentially calls for computer networks which are highly localized or channelized for operations and a LCS (Loosely coupled system) approach for message passing, information sharing and decision making results in powerful database management meaning a well maintained system that shall not go insecure. The distance also plays a major role in determining the type of network suitable like local area ( LAN), wide area ( WAN) and long haul networks. The aim is not to focus on the design aspect of computer network architecture which otherwise by itself is a subject of concern.

A graph G (V, E) consists of "n" vertices ( nodes) of computers and "m" edges (links), where n and m are integers. The value that n and m are assigned in a network will decide the connection topology like star, Ring, Tree, complete, intersecting or irregular.

A network can be defined to be composed of a set of autonomous computers which very much co-operate during the processes for co existence. "In any graph, the number of vertices of odd degree is even". A graph is completely connected if the number of edges 'm' is at least equal to n(n-1)/2 where, 'n' is the total number of vertices. This implies that there is a direct link across any two nodes. But in real-time applications like Telephone networks, one cannot assure complete connectivity and only linking junctions will provide a way for real traffic control and a good grade of service. This paves the way for Routing in networks and related issues -- like message and packet switching attempts.

Dijkstra's shortest path finder, Floyd's method and other interesting and evolving strategies of planning for effective algorithms towards problem optimizations with good performance figures in the realm of software engineering is a vast open area for researchers in their respective objectives.

One example is the use of SQL ( structured query languages) in database management which is at present taking competitive spirit on the network layers. Pattern matching and image identification is an innovative application of graph theory which has already found a place in medical diagnosis, chemical analysis and in forensic sciences. The Powerful use of graphics workstations across the globe have assisted a lot in weather forecasting and the television entertainment spheres MIMD (multiple instruction multiple data) draws good amount of attraction among architects.


FIG. 1 Instructions overlapped fashion: Instruction Fetch -- Decode -- Operands Fetch -- Execute


The floating point arithmetic with the mantissa exponent notation (employs carry save approach) serves a good blend of arithmetic pipeline for matching memory bandwidths and thus helping in smooth dataflow. FIG. 1 gives the usual instruction pipeline on SISD computers. MIMD machines capable of executing several independent programs simultaneously are aimed at throughput reliability, flexibility and availability. These employ multiport memories with time shared buses and cross- bar switching.

FIG. 2 depicts the control versus speedup on MIMD protocols. It is inferred that small number of fast processors are a solution to the objectives rather than large number of slower processors which only increases the synchronization overheads. The program segments may be visited many a time by the same operand set and collision must be avoided by a fine process strategy. In multiprocessing. the minimum time that must elapse between the initiation of any two processes is called the Latency.

FIG. 2 MIMD Speed-up

Pipelined multiprocessor system is state of the art design in parallel processing. These Von Neumann models are rather termed as control flow machines.

FIG. 3 Data flow graph.

The dataflow concept is being applied for maximizing parallelism and instruction initiation depends on data availability. Program instruction can be represented by dataflow graphs as shown in FIG. 3 example.

The data tokens are direct store places and the control tokens are instruction activators.

The processes y + 1 and y-z can go in parallel to prepare for the final result of x = ( t1 * t2) as in FIG. 3 on a dataflow approach. The basic dataflow machine of Fig. 4. has gone into VLSI microelectronics area.

FIG. 4 Basic dataflow mechanism

ADT (abstract data type) is a mathematical model together with various operations defined on the model.

In a static dataflow model. only one token is allowed to exist on any arc at any given time, otherwise the successive sets of tokens cannot be distinguished. The Jack Denis group at MIT and Manchester ARVIND's machine configured a dynamic dataflow approach employ tagged tokens using a pipelined ring structure as in FIG. 5.

FIG. 5 Ring structured data flow organization

The ID ( Irwin Df) design VAL ( Value algorithmic language) is compatible with the use of dependence graphs for compilation on conventional computers.

The major design issues in Data Flow Computing (DFC): Language implementation (one to one assignment rule); Decomposition of programs difficult; developing intelligent data driven mechanisms; efficient handling of complex data structures like arrays; memory hierarchy and allocation; software supports for compilation; and performance evaluation.


DFC assumes highly concurrent operations, matching with VLSI , VHSIC technology and increasing programming productivity for vectorization.


Data driven at instruction level causes excessive pipeline overhead per instruction which may destroy the benefits of parallelism. The data flow programs tend to waste memory space for the increased code length due to the single assignment rule and the excessive copying of data arrays. The packet switching becomes cost prohibitive with many processors.

Few of the experimental df machines use a tree structured architecture with each computing element supported by offspring elements to aid compilation, simulation and performance measurement programs.

As a design alternative for dfc, the dependence driven approach is suggested by Motooka et al (1981) and Gajski et al ( 1982) which aims to raise the parallelism level for compound function level at runtime. This includes array operations. linear recurrence and pipeline loops.

The multilevel event driven approach shown in FIG. 6 is given by Hwang and Su where an event is a logical activity defined at job level. Today's supercomputers expect the dataflow as self-servers which may be a viable approach to improve computer's performance and satisfy the aspiring clients. More of the compute-bound problems and real-time processing call for the same sequence of instructions acting on different datasets.

A good example is multiple single output functions minimization needing array processing in the realm of computer aided minimization involving multiple functional units with their own local memories working under the Central supervisory control unit. Another example is in real time transactions (railway reservation system) which has a shared database memory and concurrent handling of datafiles in the tightly coupled demand based secure systems. Voluminous data-processing towards sorting and search activities again involve divide and conquer rule for optimization. Many numerical solutions for image-processing involving Fourier transforms, various types of recursion and iterative methods repeated on varied data values to arrive at appropriate solutions and models involving statistical techniques for simulation and performance criterions again call for a parallel processing workbench. The most sought after data processing is a combination of SIMD and SISD, on these specific and unique problem areas. This trend towards SIMD is also known as vectorization for increased thruput towards speedup. The scaling up of functional processing elements with an embedded flavor for parallel working is a selective domain of computing architectures.

The dataflow diagrams assist for the MIMD actions on a scheduled tight processor for effective processor utilization. The real MIMD implies a multiprocessor ( of the same or different types) configuration attached to the transputers domain and the distributed networks. The rare combine of MISD ( multiple instructions single data) finds more fitting to applications like medical diagnosis, logic processing with prolog and Lisp languages paves the convergence to expert systems for quality and quantifiable decisions. The parallel processing is also realized employing the associative memory concept in dedicated search applications with little hardware overhead (match registers and comparator circuits). The higher power dissipation and pin/circuit ratio are the major difficulties encountered. Arithmetic pipelining towards computational tasks with concurrency will achieve a speed of 100 Mflops with reasonable precision on super computers like the CDC 6600. The unix operating system satisfied the cream of pipelined approach for interactive machines.

FIG. 6 Machine with structured control hardware


A single control unit fetches and decodes instructions. The processing elements operate synchronously but their local memories have different contents. The distinct category of pipeline, i.e., vector or array processors is attributed to factors like: the complexity of the control unit, the processing power and addressing methods of the P.E.s and the interconnection among the processing elements. In array processors, the P.Es communicate with their neighbors through a network switch and are suited for vector processing. The design complexity is felt to cover a wider range of applications. FIG. 7. shows the example of a SIMD sub system, Illiac IV which is primarily meant for matrix manipulation and solving partial differential equations. More often, since this is not meant for scalar mode, the basic software is hosted by a satellite computer and Illiac IV becomes a powerful special attachment as shown in FIG. 8.

FIG. 7 Illiac IV for SIMD

FIG. 8 Illiac IV attachment

Glypnir ( an AlGOL based language) and Fortran based CFD selVe as language domains. The difficulties in producing efficient compiled code have slowed new language developments and the burden of designing efficient algorithms with the implied knowledge of the CPU architecture is left to the user. "Turn around time is defined to be the time lapsed for a task to be completed from the time it is initiated in the area of parallel processing", which varies over a broader range of values. Handler has proposed a scheme for representing a computer. Each system is denoted by a triplet: C = ( K, D, W) where, K is the number of control units, D is the number of ALUs controlled by each unit and W is the i = unit length of the entities managed by the D's.

As example systems,

IBM system / 370 = ( 1, 1, 32),

CDC 6600 = ( 1, 1,60),

IBM/360 dual processor = (2,1,32),

Illiac IV = ( 1. 64, 64, ) and

C. mmp = ( 16, 1, 16).

The goal of scheduling algorithms is to keep a maximal rate of initiations while avoiding collisions. Latency, which is the interval of time between successive initiations can be used as a performance measure contributing to throughput rate. I/0 processors ( called communication processors) can be a desirable component on a tightly coupled system of interactive users in addition to database management and sharing strategies with distributed processing trends. OSP algorithms need hardware multipliers and adders in arithmetic processors. Negative numbers are handled by fractional two's complement representation.

A functionally divided multiprocessor system is designed primarily as a network of tasks which communicate through data objects resident in common memory. RISC architectures for matching I/O bandwidth towards MIMD invite process networking and dataflow. Dataflow machines involving a synchronous and stochastic processing support parallel environments including neural computing. The use of semaphores and preemption techniques can provide deadlock avoidance in multiprocessing trends. Highly parallel processing platforms may use multibus concept with reliable memories. The major issues of parallel systems involve the algorithms design for potential concurrency, inclusion of language features to explicitly express parallelism, the synchronization problem among cooperating and competing processors and scheduling and load balancing techniques.

The transfer speed is determined largely by the number of software synchronization points. The ratio of the bit rates for the three methods, programmed I/O : Interrupt controlled I/O : direct memory access is approximately 1 : 2 : 45 The use of bit slice microprocessors offer higher speed, ease of expansion and flexibility.

A programmable logic array is functionally equivalent to two-level AND-OR logic gates.

Mask PLA is programmed once and for all by the manufacturer.

Petrinet is a tool that provides a systematic, concise and easily comprehensible description of the sequential system shown in FIG. 9. Live and safe petrinets that are not state machines are used to model systems presenting parallelism.

FIG. 9 A structured system



Related Articles -- Top of Page -- Home

Updated: Saturday, March 11, 2017 9:41 PST