Guide to Computer Architecture and System Design--SINGLE CPU MACHINES AND PIPELINING (part 1)

Home | Forum | DAQ Fundamentals | DAQ Hardware | DAQ Software

Input Devices
| Data Loggers + Recorders | Books | Links + Resources

A program comprises of data and instructions that must be presented to the machine by means of a language. If the language happens to be command-based, then the system just executes the executable files already permanently residing on the system. Otherwise the critical aspects of compilation, interpretation, Assembling become the main task of the CPU in order to generate the .Obj files which can be linked further for execution on the machine. Strictly the above sequential organization is mandatory for stored programs as well as programs being stored. Thus, the stored program concept is obeyed perfectly on a powerful batch processing machine. Whereas on an interactive environment where users spend their time with the machine on-line, have to be more alert on the knowledge of both command and communication aspects of programming languages. On a time shared system, the interactions of user activities are still further monitored by the active operating system. Thus, it becomes clear that a CPU has to cater to the varying categories of system configuration and more likely that the CPUs are identified for one type of workbench. With the computer networks coming up in future, the total system may have to take care of all types of workstations which calls for real implementation of parallel processing character.

Pipelining is an inherent hereditary character of CPUs for optimizing the use of valuable resources in terms of processing, storage and device components. In this context the transition from 8 to 16 bit processors can be considered a revolution in evolving CPUs.


In this section the CPU activity shall be analyzed at system level. Each program consists of a set of instructions ( .exe files) to be executed for reaching the solution outputs.

The CPU contains a high strength of arithmetic and logic elements with adequate storage facility. So, first and foremost, the instruction pipelining in terms of data preparation is a continuous activity with single CPU machines. Consider the Fig. 1 which depicts the flow of an instruction cycle. This involves both the external and internal activities in terms of data fetching. Even in a Simple sequential machine the blocks (2) and (3) can go in parallel, besides allowing (1) to happen for instruction look ahead. In fact, the CPU 8086 maintains a 6 byte queue inside the microprocessor. In essence, the instructions sensitized fit best at any point of time to be simultaneously attended to. The machine cycle encoding reflects the micro-operations which have to be carried out in parallel to denote an instruction time. Microprogramming is an attractive approach towards a complete program look-ahead view in order to accommodate relocating abilities in the domain of machine capability. The additional components of help to utilize the instruction pipelining capabilities are a fairly large cache store, separate address and data buses, good stack orientation and Algorithms and data structures. It is the user responsibility as the last point is concerned, but nevertheless, is the only controllable quantity with professional programmers on a machine for deterministic outputs.

The effect of this pipelining is definitely to increase the throughput of the system at runtimes. The quantification can be done as follows:

Assuming a program p contains n instructions, i1, i2 .... ,in each having instruction times of t1, t2 , ...... , tn, the ideal program time on a single CPU which doesn't incorporate any pipelining, will be,


= Σ It. • PI = ( t1·Pl

+ t2·P2

+ ...... + tn.p) .

• ~l

… where Pi is the number of times the same instruction is used in the program. 'Ibis gives the worst case measure. But with the embedded parallelism and the major additional factors mentioned above can vary the program time to as low as 20% this value, which means on an average, a 5 fold increase on the Thruput with single CPU architectures belonging to a one-to-one category ( examples being personal computers). But, today, time has become a precious quantity that no user will be able to spend more than 4% of his time on-line. Also, secondly, it becomes essential to achieve targets within specified time limits, for a working group. These have created the need of a time shared system where also users really can spend useful time with competitive spirit. Thus in case of timesharing machines, the CPU is essentially shared and the other resources, i.e. Cache stores, software memory and user modules are sharable. So the instruction pipelining gains much more credit in providing utility services on a network where distributed computing takes shape.

Fig. 1 Instruction flow

Fig. 2 Compute pipeline architecture

Fig. 2 represents the computing pipelining (both arithmetic and logical) for speedup and decision-making categories which also can be embedded with a CPU. AB shown in Fig. 2, assuming independent blocks for each arithmetic operation and a powerful parallel logic routing are available, the instructions decoded are directly mapped on to a static micro-program available on a ROM used dynamically for generating and controlling the powerful control activity. Apart from the registers specific to a CPU, the buffer RAM provides a cushion space for visitors (both operands and operator scheduling) for a smooth flow on a well pipelined architecture. This pipelining of compute units at execute time offer the following merits: viz.,

• good vectorization ratio (data parallelism, operands);

• processor utility is improved;

• good user algorithms for arithmetic pipelines can be invaded and added on to the compute strength;

• Computer aided minimization of computer design algorithms can be implemented said to follow SMID topology; and in parallel,

• fast I/O activities can be performed.

The achieving of the above list of crucial parameters by way of pipelining calls for:

i) the number of pipe segments

ii) the length of each pipe, i.e., time spent on a subprocess

iii) adequate fast cache storage for instruction look- ahead and

iv) good virtual memory management schemes for time sharing machines on a multiuser uni-program environment.

It is clear from above, the inherent capacity of pipelined computers look and aspire for:

i) Heavy calls on the data processing and compute-bound users for a batch process.

ii) An interactive environment for group users in distributed computing angle on co operative workbenches.

Thus, the design of the operating system is important to meet the dual demands of turn around Time and throughput which more often never go together. Hence there is need for a communication processor ( which the IBM calls as I/O channels), where the dynamic job scheduling activity is apparently present for an already available static pipelined hardware. Thus, there is a lot of scope for the system engineers (Hardware implementation) to gain a lot from Research and Development in the testing phase towards effectively applying the technology grounds.


Defn: Computer Architecture can be defined as the inter relation and interaction between the static hardware and the dynamic sitting software facility embedded onto a system with the applications group counting the thruput of the machine in their own faculties/ communities". Fig. ( 4-3Xa) gives the 8085 geographical layout as user hardware design is concerned. In general, a CPU, on its own, is just a soul without the body. Hence it is highly important that CPU, whatever is connected external to the 8085 CPU makes the system set. "Like X + X· = 1 ". Essentially, the 8085 has a wordlength of 8 bits which takes the bidirectional data bus AD7-ADo and has an addressing capacity of 64 K bytes of memory (1 K byte = 1024 bytes) and device addressing of up to 256 devices. Being a sequential processor, the following features are note worthy:

1. The 8085 CPU makes use of multiplexed address/ data bus;

2. Employs an I/O mapped I/O with dedicated input output instructions for process instrumentation;

3. Provides serial communication;

4. Maintains hardware interrupt levels and priority is followed:

5. Has software interrupt instructions;

6. Provides direct memory access with an additional DMA chip (8257) for fast and large I/O transfers;

7. Has a dock speed of 2 to 3 MHZ,

8. Instruction length of 1, 2 and 3 bytes;

9. Accommodates simple and easy addressing modes;

10. Includes powerful stack instructions for embedded pipelining;

11. Giving user flexibility in I/O design for selective applications;

12. A powerful Logical instruction set;

13. An effective assembler towards microprocessor development system;

14. A monitor ROM resides for machine level decoding with basic 8085 systems.

In side the CPU, with reference to Fig. 3 (b):

1. The microprocessor has a good register set A,B,C,D,E,H,L;

2. the active accumulator register more often serving as space for input operand as well as delivered result;

3. Set of FIVE flags indicating the process status for powerful actions;

4. The PC (program counter) register as an instruction pointer for a program in execution;

5. A stack pointer to keep track of both systems stack and user stack area;

6. Temporary registers W and Z for scratchpad work in an opaque manner;

7. A powerful ALU with limited arithmetic capability;

8. Address/ data buffers and a Multiplexer to reflect the inherent parallelism of the machine.

9. The instruction register (8 bits) for usual Op code fetching employs the same width for ease of decoding and encoding operations.

Hence out of the 8 bit family of microprocessors, the 8085 can be put to optimum use with the particular chosen environment.

Address bus Fig. 3 (a) CPU 8085 outer

Fig. 3 (b) CPU 8085 by Intel

16 bit Microprocessors

The 8086 from Intel is discussed in what follows:-

Fig. 4 (a) depicts the external pin outs of 8086.

This is upward compatible to 8085, meeting the software set of 8085 for byte operands besides supporting 16 bit operations because of the BHE signal as in Fig. 4(a). This can address up to 1024 kilo (1 mega) bytes of memory (using multiplexed data address bus) with a larger amount of internal registers meant for memory management at runtime for a higher throughput. This includes 9 flags for process status.

It supports a good mix of addressing schemes with the flavor of maintaining a 6 byte ( instruction) queue within the CPU as shown in Fig. 4(b). It has direct multiply instructions to support arithmetic. Additionally 8086 supports string variables subjected to character data. Also the IN, OUT instructions provide the I/O mapped I/O configuration.

The design of the system itself incorporates the bus interface function and the powerful execute logic. It embeds in itself an instruction pipelining which essentially is a link to data concurrency.

The 8086 accommodates a good exception handling facility supporting,

Fig. 4 (a) External configuration of 8086 CPU & (b) Features for pipelining

External interrupts; Internal interrupts: Divide by zero; and Single step (tracing at instruction level);

It runs under the supervision of a CLOCK speed of 4 MHz to 8 MHz. These feature~ and enhanced versions from Intel made the 16 bit processors fit for also the personal computer class.

The LOCK instruction can be employed when 8086 is used in MAX mode, which allows the bus arbitration unit to grant system bus to processors also at the same time denying bus requests from other processors. See Fig. 4-4 (c). The main application of this facility is in distributed digital instrumentation systems where the response time is a critical parameter to be considered for an effective on-line real-time process. This exclusively touches the domain of ASICs ( applications specific integrated circuit group). CPU 8086 is used in MIN or MAX mode Most programs have a good deal of scalar code ( about 20%) on a sequential Von Neumann machine Thus, good performance cannot be obtained unless the scalar operations are speeded up in addition to Vectorization.

Fig. 4 (c) Convolution efficiency for different network sizes

This is commonly known as Amdahl's law stated as:

"When a CPU has two distinct modes of operation) a high speed mode and a low speed mode, the overall speed is dominated by the low speed mode unless the low speed mode can be totally eliminated", Zilog Z-80 Microprocessor features:

• Uses N-MOS technology

• Clock-speed 2.5MHz to 6 MHz

• Separate 16 address and 8 data lines.

• Has only two interrupt lines and has 158 basic instructions

• Operates on + 5V power supply.

• Provides more addressing modes having index registers.

LDIR , (Load, increment and repeat)

DJNZ, (Decrement B and jump if non zero)

CPIR, (Compare increment and repeat)

• Includes bit manipulation in registers and memory.

• Good set of I/O instructions.

• 280 is supported by parallel I/O, the clock timer, DMA and serial I/O chips.

Fig. 5(a) and (b) depicts pin-outs and in-contents of CPU 2-80 respectively. Motorola conducts extensive reliability tests to qualify devices, to evaluate process and material changes and to accumulate generic performance data. The results of these tests provide the basis for production decisions and the generation of reliability reports for customer use.

Fig. 5 (a) Z-80 CPU Pinouts

Fig. 5 (b) CPU Z-80 registers

Fig. 6 Motorola MC 6800 CPU

Reliability testing performed by Motorola Mos u_p (mu p) division during the last decade has produced excellent results.

Fig. 6 depicts the Motorola Me 6800 microprocessor.

Features of 6800 include:

• 72 basic instructions;

• Make extensive use of memory referencing;

• Employs memory mapped I/O;

• Clocks at 1 MHz and uses separate address, data buses.

• Employs N-MOS technology.

The address output lines for the 6800 microprocessor can sink 2 mA in logical 0 state and source 150 uA in logical 1 state. The VMA signal generated from the 6800 microprocessor is called a valid memory address. All system transfers are treated as memory transfers. Thus, 6800 follows memory mapped input /output. 1he 6800 CPU employs Φ2 clock Signal in the generation of the system control bus. Activity with the system takes place on the falling edge of Φ2 clock (1 u_sec). In order to slow down the 6800 microprocessor to access memory or i/o devices, the Phi-z pulse-width is stretched for a required length of time to access a particular memory location. 1he pulse-width can be extended to a maximum of 4.5 microseconds. By slowing down the microprocessor, we can allow enough time for the input to interface to slower memories without adding any hardware.

Fig. 7 Motorola 68000 power

MC 6801 is an 8-bit Single-chip microcomputer unit (MCU) which significantly enhances the capabilities of the M6800 Family of parts.

It includes an upgraded M6800 MPU with upward-source and Object-Code compatibility. Execution times of key instructions have been improved and several new instructions have been added including unsigned multiply. ( 64 K byte address space). It is TTL compatible and requires one +5 - vdt power supply.

On-chip resources include 2048 bytes of ROM, 128 bytes of RAM, a serial communications interface. Parallel I/O and a three-function programmable timer.

Software of 6801 includes 8 bit multiply, five flags with PC relative addressing, internal clock with divide by 4 output of maximum 8 MHZ and parallel I/O facility.

An EPROM version of the Me 6801, the Me 68701 micro computer is available for systems development.

Me 68000 has a 16 bit data bus and 23 bit address bus. Words begin only at even byte addresses. Memory can be viewed logically as linear array of 16 megabytes.

It has a powerful registers set as shown (32 bits) in Fig. 7 with compact addressing modes including Relative addressing, with a separate user stack pointer. Provides three interrupt levels of HW priority ( IPLo to IPL). This 68000 has a powerful ALU ( 32 bits) with more of a sequential approach.

This with M68451 (memory management unit) can be used to implement ~ multiprogramming features.

Particular attention is required to the problem of resource allocation on a multi-user system. Resource sharing is one aspect on single CPU machines which must be done precisely. In 68000, the semaphore concept can be utilized by making use of the TAS instruction. In addition, this CPU provides a certain protection to users.

With these 16 bit microprocessors, it is indeed possible to employ more CPUs in order to increase and improve the throughput with resource sharing concept.

AM 29000 from Advanced Micro devices is a powerful 32 bit u_p said to RISC.

It has:

• 64 global registers ( 32 bits wide)

• 1281oca1 registers.

Any of the 192 registers can be used in instructions. It takes 8 bits to specify 1 register address. All instructions are 32 bits long (with more often a single byte opcode).

• Some are stack registers.

• Ease of programming lies with the programmer to optimize the program size and time as per demand.


The store registers within a central processing unit possess varying capabilities. Some of the desirable characteristics of register operations include shift, count, clear and load.

The data that is already available as machine code in the main semiconductor memory constitute the program with the macro instructions at the machine level contributing to the subsequent micro-operations.

Most of the high level language statements get executed by a majority of data movement involving registers, stacks, memory segments with a strict synchronous, control activity governed by the system clock. Each macroinstruction comprises of one or more machine cycles which in turn are made up of clock periods. Data movement between memory and registers employ the conventional memory address register (MAR) and memory data registers (MDR). This type of data transfers can also be conditional specified by a Boolean variable. Normally the letters constitute a register name and comma separates two micro-operations. Arrow denotes the data movement clearly indicating the source and destination values. The symbols used to denote registers and some micro-operations is given in Table 4-1.

Table 1.

At runtime, each machine instruction is decoded and the machine cycles are decided to score up the program counter. 1hese machine cycles can be done wholly serially or a certain amount of overlapping (pipelining) is possible by the inherent control architecture to optimize the processor resource. Normally fixed length opcodes are preferred and the instruction decoding is done according to the nature of the instruction lengths and machining abilities of the processor instruction set. 1his also leads to the deterministic approach of static microprogramming and fixed finite state machines.

Dynamic RAMs possess the destructive readout Characteristic.

Writing the sequence of micro instructions for machine level instructions is known as microprogramming. This makes the hardware design to be simple. In microprogrammed design, new instructions can be implemented by writing new sets of microinstructions, which permits emulation.

As examples, the 16 bit processors Intel 8086 and Motorola 68000 employ microprogramming.

Some of the arithmetic micro-operations are listed below: (A) t-- (A) + (B) Content.<; of A plus B transferred to A At--A

Contents of register A is 1 's complemented (B) t-- (B) + 1 B register contents is incremented by 1.

Complement carry CMC } Some instructions capable of Set carry flag STC } operating an processor flags.

Other types of instructions might include logical operations, stack operations, special input/output instructions, machine control operators and the communication instructions embedding priority control (like the SIM, RIM of 8085). With 2 variables, {( 22) 2 =16)) microinstructions are possible. Some of the logic operations implemented by a multiplexer (MUX) is shown in Fig. 8. The MUX serves as selecting either arithmetic or logical instruction and also for the register select control. Similarly, a demultiplexer is used for outputting activity on a select basis with segmented memory configurations.

Fig. 8 Logic operations using MUX

The type and methods of operating procedures involve queue, stack and dequeue structures besides the enormous ever increasing capacities of process registers.

The stored machine language program is composed of opcodes and operands as basic building blocks, which may have to cater to different addressing mechanisms and varying concurrent events in constituting the instruction set. The user support to hardware interrupts on an operating system is a critical design factor for interactive on-line benches.

The microprogramming area must also cater to good amount of buffer storage for fault tolerance.

The microprogramming must desirably support assembler environment with good error diagnostics of a symbolic program presented as input. The microprogrammer has to meet the task of building up essential macro-calls and the embedded subroutine linkages wherever possible to mean a better static software.

The multitasking multiuser machines are often interruptible at the command level and the same has to be catered well in data base management systems and distributed compute-bound processor systems. They have to take care of collision avoidance, deadlock prevention and system security measures.

Fixed design machines of the RISC type often employ hardwired control. Whereas in microprogramming, a writable control memory possesses the flexibility of choosing the instructions of a computer dynamically. However, most microprogrammed systems use a ROM for the control memory because it is cheaper and faster than a RAM and also to prevent the occasional user from changing the system finiteness.



Related Articles -- Top of Page -- Home

Updated: Thursday, March 9, 2017 22:38 PST