Data Representation

From ASM Book

Jump to: navigation, search

Webster defines a computer as a programmable electronic device that can store, retrieve, and process data. As a programmer, you will be working with data, whether it be databases or user interactions. This section will bestow upon you a profound understanding of what data really is, and allow you to view problems in a new and interesting way.

Contents

Data Units

Humans can view data in a number of ways, such as the time of day, a friends name, or a picture of a tree. A computer, on the other hand, has only three ways of interpreting data: either as a quantity that is processed, machine code that is executed, or having no meaning at all. In addition, the computer relies upon you, the programmer, to tell it how to interpret that data.

Bit

The smallest "unit" of data in a binary machine is the bit(1). A bit has only two states, 0 and 1. The binary nature of the bit is inherited from the underlying electronics. The digital circuits that run a computer only have two levels, high and low. The meanings of high and low are tied to the hardware in use, and will not be covered in this book(2).

If a single bit is used to represent a number, that number will only have two possible values. Usually, a single bit on its own in the 0 state will represent the number 0, and in the 1 state will represent the number 1.

Setting, clearing, and toggling bits

To set a bit means changing its value to 1. To clear a bit means changing its value to 0. To toggle a bit means inverting its value (if it is currently 1, then toggling it will change it to 0).

Grouping Bits

One bit has two possible states: 0 and 1. If you have two bits, Bit A and Bit B, the total amount of combinations you can form is 4--you have the two states of Bit A while Bit B is 0, plus the two states of Bit A while Bit B is 1. If you add another bit, Bit C, you have 8 possible states--you have the 4 combinations of Bits A and B while Bit C is 0, plus the 4 combinations of Bits A and B while Bit C is 1. If you add a Bit D, there are 16 combinations--the same reasoning applies. Mathematically describing the amount of combinations formable by a group of 4 bits,

  2*2*2*2

or

  2^1 * 2^1 * 2^1 * 2^1 = 2^4 = 16
A group of n bits will have a total of 2n possible states.

Usually, the state of a group of bits is used to represent an integer. (An integer is a number that has no fractional portion--that is, if you write out the number on paper, there will only be zeroes after the decimal point.) Each possible state of the group of bits will correspond to one unique integer. That is, given an integer, you can figure out what state the bits must be in to represent that value (if it is possible to represent that value with those bits). Also, if you know the state of the bits, you can figure out what integer is represented.

If you wanted to use the state of a group of n bits to remember the value of a nonnegative (that is, zero or positive) integer, without any gaps in the possible values, then the lowest number you can remember is 0, and the highest number you can remember is 2n-1. That's a total of 2n possible numbers. If you want to be able to remember a higher number, you need to use more bits.

For ease of communication, special terms are used for the most commonly used bit groupings.

Byte

The byte is the smallest unit of data that a microprocessor can manipulate. The 80x86 byte is a group of 8 bits clubbed together to represent a total of 28 = 256 bit states. If a byte is used to represent a nonnegative integer, the smallest number that it can represent is 0 (00h). The highest number it can represent is 28-1 = 255 (FFh). If you wanted to represent 256 or any higher number, you would need to use a larger-sized unit of data.

Octet

On most existing microprocessors, a byte is 8 bits. But the size of a byte depends on the microprocessor in question. That is why you are bound to come across the term "octet", which means "a group of 8 bits"--regardless of whatever the size of a byte may be.

Nibble (or Nybble)

There are two nibbles in a byte. One nibble is comprised of the first half of the bits in the byte, and the other nibble is comprised of the last half of the bits in the byte. On an 80x86, a nibble is 4 bits in size.

Word

'Word' is probably the most confusing term in data representation. There are actually two meanings. The original definition refers to the machine word, which is the maximum number bits that a processor can work with at a time. 32-bit microprocessors have machine words of 32 bits, whereas 16-bit microprocessors have words of 16 bits. The other definition of 'word', which is what 80x86 assembly community and this book uses, is a group of 16 bits.

Double-word

A group of 32 bits.

Wyde (Knuth)

D.Knuth (author of The Art of Computer Programming) has invented a new term for 16-bit data to get away from the ambiguity of the term 'word'. Because of the use of 'w' in functions handling 16-bit characters, he adopted a term that starts with 'w'. Also, 16-bit characters are currently known as "wide" characters in the C and C++ languages.

Tetra (Knuth)

Another D.Knuth term, which is short for tetrabyte. Four 8-bit bytes is 32 bits.

Octa (Knuth)

Another D.Knuth term, which is short for octabyte. Eight 8-bit bytes is 64 bits.

Use of groups of bits

What can you do with a word? For instance, you can represent the states of 16 lights; one bit per light. We can declare that if a bit is set, then its corresponding light is turned on. If the bit is cleared, then the light is off. And since each bit is independent of the others, some lights can be on while others are off.

A word can also represent a quantity. In the above example, you can find out how many lights are turned on by testing the number of bits that are set. If none of the lights are on, then zero bits are set. If 5 lights are on, then 1+1+1+1+1 = 5 bits are set. Any quantity, from 0 through 16, of 'on' lights can be expressed with the word; 17 distinct values.

Let's assume you only care about how many lights are on. Since you are using 65536 states to represent 17 values, the above use of a word is a waste. The most efficent way to represent a value is by assigning weights to each bit position. The binary nature of bits suggests that we should use the base-2 numbering system.

Representing integers as binary numbers

You're already familar with at least one numbering system: decimal. A decimal digit, the base-10 couterpart of the bit, can have a value from 0 through 9. What if you need to represent a number larger than 9? Add another digit. Fifteen can be represented as 78 = 7 + 8 = fifteen. Forty three might be 99997! (Kind of resembles Roman numerals.)

The problem with this system of simply adding the digits is that there are many redundant numbers. 69, 78, 87, 96 all represent the value 15. There is not one unique value represented by each number.

Each digit in a sequence of digits that form a number can be assigned a unique number to identify that digit, called a "place". The rightmost digit is said to be in place "0", and each other digit has a place that is one higher than its neighbor to the right. For example 
6502
||||
|||\-> Place 0
||\--> Place 1
|\---> Place 2
\----> Place 3

Next, we can assign place values to each place. This way, less digits are required to represent the same value. Let's assume that each digit has a place value that is 3 times greater than its neighbor to the right. For example, 11 would equal 1*(place value of digit in place 1) + 1*(place value of digit in place 0) = 1*3^1 + 1*3^0 = 3 + 1 = 4. To represent 43, we could write 421, which would equal 4*3^2 + 2*3^1 + 1*3^0 = thirty-six + six + one. We were able to use 3 digits instead of 5.

There is still room for improvement. It turns out that the best factor to increase place values by is the radix of the numbering system in use. The radix is the first value that cannot be represented by a digit. In decimal, the highest digit is 9, so the radix is 10. Conveniently, this also equals the amount of possible digits. By using the radix as the factor to calculate place values, each digit can be used to represent a multiple of the first value that cannot be represented by the digits to the right of it. In this way, there are no duplicate representations of any value.

  • The place value of a place is (r^p), and the value of a digit in a place is (r^p)d where r=radix, p=place, d=digit number.

Because of this equation, the radix of a numbering system is also called the base of the numbering system. For example, decimal is referred to as base 10, binary is referred to as base 2.

To illustrate the equation, we will calculate the value of 6502
6502
||||
|||\-> (10^0)2 = 1*2    = 2
||\--> (10^1)0 = 10*0   = 0
|\---> (10^2)5 = 100*5  = 500
\----> (10^3)6 = 1000*6 = 6000

6000+500+0+2 = 6502.
Now an example in which binary is translated to decimal (binary values end in "b" to show that they are binary)
1101b
||||
|||\-> (2^0)1b = 1*1 = 1
||\--> (2^1)0b = 2*0 = 0
|\---> (2^2)1b = 4*1 = 4
\----> (2^3)1b = 8*1 = 8

1+0+4+8=13.

Thus, 1101b is the way to write 13 in binary.

To convert a decimal number to binary, just keep dividing the number by 2-- the remainder equals the digit, and the quotient is the number to use to get the next digit to the left. For example, let's convert 6502 to binary
6502 / 2 = 3251 REM 0
3251 / 2 = 1625 REM 1
1625 / 2 =  812 REM 1
 812 / 2 =  406 REM 0
 406 / 2 =  203 REM 0
 203 / 2 =  101 REM 1
 101 / 2 =   50 REM 1
  50 / 2 =   25 REM 0
  25 / 2 =   12 REM 1
  12 / 2 =    6 REM 0
   6 / 2 =    3 REM 0
   3 / 2 =    1 REM 1
   1 / 2 =    0 REM 1

So, 6502 expressed as a binary number is 1100101100110b.

Binary numbers and groups of bits

Until now, two separate topics have been discussed: groups of bits, and binary numbers. Now we want to know how to use a group of bits to represent binary numbers. Each bit can represent one binary digit-- therefore, a binary number of n digits can be represented by a group of n or more bits.

Bits in a group of bits are assigned unique numbers to identify them, in exactly the same way as places in a number are assigned place numbers.

0000  <- a bunch of bits set to 0
||||
|||\-> Bit 0
||\--> Bit 1
|\---> Bit 2
\----> Bit 3

For any two bits A and B, where bit A has a higher bit ID number than bit B, (for example, if bit A is Bit 4 and bit B is Bit 2,) we say that bit A is a "higher" bit than bit B. We also say that bit A is "to the left" of bit B. When you write out a sequence of bits, you write the bits in descending order-- high bits to the left, low bits to the right.

If we want a group of bits to represent a binary number, we usually use bit n to hold the digit in place n of the binary number. So, bit 0 holds the value in place 0, bit 1 holds the value in place 1, etc... For example
1101  <- A bunch of bits representing the value 1101b
||||
|||\-> Bit 0, holding the digit in place 0, which is 1
||\--> Bit 1, holding the digit in place 1, which is 0
|\---> Bit 2, holding the digit in place 2, which is 1
\----> Bit 3, holding the digit in place 3, which is 1

Groups of groups of bits

It may be that you are working with a system that only supports 8-bit bytes. How could you store a number higher than 255?

You can combine bytes (or other groups of bits) in the same way as you combine digits. The radix of a byte is 256 (the first value that can't be represented by a byte). So, given a Byte 1 which stores the value 152 and a Byte 0 which stores the value 74, the combined value represented by these bytes can be considered to equal 152 256^1 + 74 256^0 = 152 256 + 74 = 38986. The maximum value you can store with 2 bytes is 255 256^1 + 255 256^0 = 255 256 + 255 = 65535, which is also the maximum value you can store in a group of 16 bits.

Representing negative numbers

The above method is only one possible way to represent a number with bits. While there are always 2n possible values represented by a group of n bits, you can assign numeric values to each combination of bit values as you like, so the range of values isn't always from 0 to 2n-1. For example, with 8 bits you can represent values in the set {0, 2, 4, 6, ... 508, 510}. You can also represent values in the range -128 to 127.

If a value cannot be negative, it is called an unsigned number. If a value can be negative or nonnegative, it is called a signed number. This is because you need to remember the negative sign associated with the number--that is, whether the value is negative or not.

The method of storing signed numbers on an 80x86 is called the "two's complement" method. For such a signed number, the highest bit is not considered to be a binary digit, but is considered to be the "sign" of the number. If the bit is 1, the number is negative--otherwise it is nonnegative (zero or positive).

When the number is signed, there is one sign bit, and the remaining bits represent an unsigned integer. When the sign bit is 0, the value is equal to the value of that unsigned integer. When the sign bit is 1, the value is equal to the value of the unsigned integer minus the radix of that unsigned integer.

For example, with an 8 bit signed number, the highest (leftmost) bit is the sign bit, and the remaining 7 bits are the unsigned integer. This unsigned integer is in the range 0 to 127. The radix of that unsigned integer is 128. So, when the number is negative, the value 128 is subtracted from the value of the unsigned integer to get the value of the entire byte. Thus, when the value is negative, the byte has the range 0-128 to 127-128, or -128 to -1, and when the value is positive, the byte has the range 0 to 127. The total range of possible values is -128 to 127.

If all bits are 1, (that is, the sign bit is 1 indicating that the value is negative and the unsigned integer is equal to 127,) the value will equal -1. This is true regardless of how large the group of bits is. Also, if the sign bit is 1 and all other bits are 0, then the value is equal to the lowest representable value.

Another way of describing two's complement signed integers is to say that the place value of the leftmost bit is negated. That is, instead of the place value of the leftmost bit of a byte equalling 128, it equals -128. So, if that bit is set, -128 is added to the value of the byte.

80x86 conventions

On the 80x86, it's presumed by the instructions that a group of n bits represents either any unsigned integer in the range 0 to 2n-1 or any signed integer in the range -2n-1 to 2n-1-1, using the methods described above. So, you will usually store integers in this way to make the most use of the instructions. The actual process of storing numbers in this way is automatically done by the assembler and the microprocessor.

Footnotes

1. The word Bit originates from the term "Binary digIT".

2. In traditional digital circuits, a high is represented by a +5 volt signal, and a low with ground (0V) on the line. The device that is driving the signal is either sourcing +5V or sinking to ground. In RS-232, a serial communications protocol, 1's are represented by -12V (-7 to -15), and 0's as +12V (+7 to +15).

(I've removed note 3. Of course computers have concept of "left and right". It's not literal left and right spatial positioning, but the words "left" and "right" are clearly defined to indicate the relationship of bits in groupings of bits, so it is acceptable to use those terms. For example, look at the instructions SHL and SHR. It's not like you can't tell what direction "left" is.)

Personal tools