title>GB 13000.1-1993 Information technology Universal multi-octet coded character set (UCS) Part 1: Architecture and basic multilingual plane - GB 13000.1-1993 - Chinese standardNet - bzxz.net
Home > GB > GB 13000.1-1993 Information technology Universal multi-octet coded character set (UCS) Part 1: Architecture and basic multilingual plane
GB 13000.1-1993 Information technology Universal multi-octet coded character set (UCS) Part 1: Architecture and basic multilingual plane

Basic Information

Standard ID: GB 13000.1-1993

Standard Name: Information technology Universal multi-octet coded character set (UCS) Part 1: Architecture and basic multilingual plane

Chinese Name: 信息技术 通用多八位编码字符集(UCS) 第一部分:体系结构与基本多文种平面

Standard category:National Standard (GB)

state:in force

Date of Release1993-12-24

Date of Implementation:1994-08-01

standard classification number

Standard ICS number:Information technology, office machinery and equipment >> 35.040 Character sets and information coding

Standard Classification Number:Electronic Components and Information Technology>>Information Processing Technology>>L71 Coding, Character Set, Character Recognition

associated standards

Procurement status:≡ISO/IEC 10646.1-93

Publication information

publishing house:China Standards Press

other information

Release date:1993-12-30

Review date:2004-10-14

Drafting unit:Standardization Institute of the Ministry of Electronics Industry, Computer and Microelectronics Development Research Center, North China Institute of Computing Technology

Focal point unit:Standardization Institute of the Ministry of Electronics Industry

Proposing unit:Ministry of Electronics Industry of the People's Republic of China

Publishing department:State Bureau of Technical Supervision

competent authority:National Standardization Administration

Introduction to standards:

It specifies the Universal Multiple-Octet Coded Character Set (UCS) which can be used for the representation, transmission, exchange, processing, storage, input and display of written forms and additional symbols of various languages ​​in the world. GB 13000.1-1993 Information Technology Universal Multiple-Octet Coded Character Set (UCS) Part 1: Architecture and Basic Multilingual Plane GB13000.1-1993 Standard Download Decompression Password: www.bzxz.net

Some standard content:

National Standard of the People's Republic of China
Information technology -Universal Multiple-OctetCoded Chararter Set (UCS)-Part 1: Architecture and Basic Multilingual PlaneGB 13000. 1--93
[S0/IEC 10646. 1- -1993
This standard is equivalent to the old international standard IS0/IEC10646. 1--1993 Information technology Universal Multiple-OctetCoded Character Set (UCS) Part 1: Architecture and Basic Multilingual Plane1 Subject content and scope of application
Line 130 specifies the universal multiple-octet coded character set (UCS), which can be used for the representation, transmission, exchange, processing, storage, input and display of written forms and additional symbols of various languages ​​in the world. This part of GB13000 specifies the overall architecture of LCS, and: a. defines the terms used in GB-3000; h. describes the overall structure of this coded character set; c. specifies the Basic Multilingual Plane (MP) of UCS, and defines a graphic character set for the written forms of various scripts and languages ​​used worldwide; d. specifies the encoding and names of the graphic characters of BMP; e. specifies the four-octet (32-bit) positive form of LCS: UCS4; specifies the double-octet (16-bit) BMP form of LCS: UCS 2; specifies the encoding representation of control functions, and specifies the management method for supplementing the coded characters in the future. UCS is a different system from the encoding system specified in GB 2811. From GB 2S1I, the method for specifying LICS is specified in F17.2.
2 Conformity
2.1 General
In any case, as long as the special characters are used in accordance with the method specified in this national standard, the following conformity requirements no longer apply to these special characters themselves.
2.2 Conformity of information exchange
A coded character data element in coded information exchanged by the user conforms to this national standard if the following conditions are met: a. The coded representation of all graphic characters in the terminal code character data element conforms to Chapter 1 and Chapter 7, and a recognized form selected from Chapter 14, and also conforms to a marked implementation level selected from Chapter 15: b. All graphic characters represented in the coded character data element are derived from a marked set (see Chapter 13), c. The coded representation of all control functions in the coded character data element conforms to Chapter 16. The declaration of conformity must identify the form used, the level of implementation used and the subset given in the list of standards (or) implemented on August 1, 1994, approved by the State Administration of Technical Supervision on December 24, 1993.
2.3 Conformity of equipment
GB13000.1-93
If a device meets the requirements of the following items and meets one or all of the requirements of items a and c, the device is said to comply with this national standard.
Note: "The term equipment (in 4.17) is defined as a component of information processing equipment that can transmit and/or receive coded information in the form of coded character data elements. Equipment can refer to input/output devices in the conventional sense, or to processes such as application programs or gateway functions. The declaration of conformity must identify a document containing the description specified in the following items: and must be marked Identify the form used, the level of implementation adopted, the subset of the collection list and/or character list adopted to provide the mountain, and the control functions adopted according to Chapter 16.
Device description: A device that complies with GB13000 should be an object of description. The so-called description is to identify the means by which the user provides characters to the device, and/or the recognition method after the user receives these characters, as specified in the following items b and c respectively. b Originating device: The originating device must allow its user to provide any character from the adopted subset and be able to transmit the content of the coded character data element according to the adopted form and implementation level. The receiving device must be able to receive and interpret the coded representation of any character in the coded character data element, depending on the form and level of implementation used, and must make any corresponding characters in the subset used available to the user in a manner that the user can recognize.
For any corresponding characters not in the adopted subset, the user should be prompted in some way, but these characters do not need to be distinguished. ① There are two ways to prompt the user: using the same character to represent characters not in the adopted subset or, when it is appropriate for a certain type of user, providing an audible signal or a identifiable signal ② About For receiving equipment with retransmission capability, see Appendix H (reference). 3 Reference standards
The clauses contained in the following standards constitute the clauses of this standard through reference in this standard. At the time of publication, the versions shown are valid. These standards will be revised, and parties using this standard should explore the possibility of using the latest versions of the following standards. GB2311-90 Information processing seven-bit and eight-bit coded character set code expansion technology GB5261—85 Supplementary control functions for text and symbol forming equipment 4 Terminology
The following definitions apply to GB13000.
4.1 Basic Multilingual PlaneBasie Multilingual Plane (BMP) 00 plane of the 00 group.
4.2 block
A collection of adjacent characters that have common characteristics (such as a certain text). 4.3 canonical form
Specifies a form of characters in this coded character set, which uses four bits to represent each character. 4.4 Coded-character-data-element (Coded-Character-Data-Elemcnt) An element of information to be exchanged, which consists of a sequence of coded representations of some characters according to one or more identified coded character set standards.
4.5 cell
The position in a row where a character can be arranged.
4. 6 character character
An element in a set of elements used to organize control or represent data. 4.7 Character boundary characterboundary GB 13000.1—93
In octet stream, the boundary between the last octet of the coded representation of a character and the octet of the coded representation of the next character.
4.8 Coded character coded character
Character and its coded representation.
4.9 Coded character set coded character set An unambiguous rule for establishing a pair of semantic relationships between a character set and the characters in that character set and their coded representations.
4.10 Code table codetahlc
A table showing the characters assigned to each octet in a code. 4.11 Combining character combining character A structural element in a marked subset of the coded character set of this national standard, used to combine with the non-combining graphic character that precedes it, or with a combining character sequence preceded by a non-combining character (see 4.13). Note: This part of GE13000 defines several sets of characters containing combining characters. 4.2 Compatibility character Graphic characters included as coded characters of GE1300 primarily for compatibility with existing coded character sets. 4.13 Composite sequence A sequence of graphic characters consisting of a non-combining character followed by one or more combining characters (see 4.11): Note: The graphic symbols used for a composite sequence are generally a combination of the graphic symbols of each character in the sequence. 2 A composite sequence is not a character and is therefore not a component of the GE1300 vocabulary. 4.14 Control function An action affecting the recording, processing, transmission or interpretation of data, whose coded representation consists of one or more octets. 4.15 Default state The state assumed when no state is explicitly specified. 4.16 detailed code table detailed code table showing the code for individual characters and usually part of a line. 4.17 device device
component of information processing equipment capable of sending and/or receiving coded information in coded character data elements (it may be an input/output device in the conventional sense or a process such as an application or gateway function). 4.18 graphic character graphic character distinguished from characters of control functions, usually having a visual representation for writing, printing or display. 4.19 graphic symbol graphic symol
visual representation of a graphic character or composite sequence. 4.20 group groupwww.bzxz.net
a division of the coding space of the coded character set, having 256×256×256 bits. 4.21 Interchange
The transmission of character coded data from one user to another using communication means or interchangeable media. 4.22 Interworking
A process that allows two or more systems using different coded character sets to interchange character coded data accurately, which may involve conversion between two codes. 4.23 Octet
An ordered sequence of eight digits (bits) considered as a whole. 4.24 Plane
A division of paper into 256 × 256 bits. 4.25 Presentation+to present GB 13000.1-93
The process of writing, printing or displaying a graphic symbol. 4.26 Proteomic presentation form In some literary presentations, a certain graphic symbol form representing a character, which depends on the position of the character relative to other characters.
4.27 Private use planes A plane whose contents are not specified in GB13000 (see 10.1). 4.28 Vocabulary
A specified set of characters represented by a terminal code frame. 4.29 Row
A unit of division of a half-character, which has 256 glyphs. 4.30 Script seript
A set of morphological characters used in a book of one or more languages. 4.31 Supplementary planes A plane containing those characters not included in the basic multilingual plane. 4.32 User: An individual or other entity that enjoys the services provided by a device. (For example, if the device is a transcoder or gateway function, the user entity may be an application program such as an application program.) 4.33: A sequence of characters in a code table. It consists of one or more lines (whole lines or partial lines) containing characters of a specific class (see Chapter 8).
5 UCS overall structure
This chapter describes the overall structure of the universal multi-octet coded character set (hereinafter referred to as "this coded character set") and illustrates it in Figures 1 and 2. The specifications of this structure are given in the following chapters. In GB13000, any eight-bit character is represented by hexadecimal notation from 00 to FF. See Appendix J (reference number). The canonical form of this code word set and its expression method use "a four-dimensional coding space which is represented as a single entity and consists of 12B three-dimensional groups. Note: In this regard, the 8th bit of the most significant eight bits in the canonical form of the abbreviated character is used for internal processing of the device. As long as its value is set to zero within the standard coding character data element rate, each group contains 256 two-dimensional planes. Each of them contains 256 one-dimensional rows, and each row contains 256 word positions. A character is arranged and encoded on a word position in this coding space. Then, it is declared that the glyph is not used. In the stop point, four octets are used to represent each symbol, and the group, plane, row and glyph are indicated accordingly. Since two octets are not enough to contain the symbols on the boundary, and the 32-bit representation is in line with the architecture of modern processing systems, the regular form consists of four octets.
The four-octet regular form can be used as a four-octet coded character set, which is called CS-4. The first plane (the 00 plane in the 00 group) is called the basic script plane. It includes characters commonly used in alphabetic characters, syllables and ideographic characters. And various symbol numbers. BMF also has a restricted use (RUJ) area, in which symbols have special properties. The planes that follow are considered auxiliary planes or special planes, used to accommodate attached graphic symbols. The 32 planes with plane octet values ​​E0-FF in group 00 are special planes. The 32 groups with group octet values ​​60~7F in this coded character set are special groups. GB13000 does not specify the content of the character bits in the special area. Each character is arranged in this coded character set according to its group octet, plane octet, row octet, and character octet. In addition to the regular form, The national standard also specifies a double-octet BMP format. Therefore, this multi-language plane can be used as a double-octet coded character set, marked as UCS.2.
GB13000.1-93
In order to give a vocabulary of graphic characters, it is called using a subset of the coding space. Appendix G (reference) specifies a transformation format of UCS (UTF-1), which can be used to transmit text data in communication systems that are sensitive to the octet values ​​of control characters encoded according to the (B2311 structure). 6 Basic structure and terminology
6.1 Station structure
The universal multi-octet coded character set specified in GB13000 should be regarded as a single entity. The entire coded character set should be expressed as containing 128 groups, each of which has 256 planes. Each plane should be regarded as containing 256 There are 256 characters in each row. In the code table representing the plane content (such as Figure 2), the horizontal axis should represent the lowest octet, and the left octet has the smallest value; while the vertical axis should represent the higher octet, and the top octet has the smallest value. Each axis in the coding space should be encoded with an octet. In each octet, the highest bit should be the 8th bit, and the lowest bit should be the 1st bit.
Accordingly, the weight of each bit should be: 8th bit
7th bit
6.2 Character Encoding
6th bit
5th bit
4th bit
2nd bit
1st bit
A 4-bit sequence is represented. In the canonical form of the coded character set, each character in the entire coded character set must be composed of the highest octet and the lowest octet should be the lowest octet. Thus, this sequence can be expressed as: TTL. s.
Group octet
(Graup-rtet)
Plane-octet
(Plane-octet)
Where, ms means the most significant octet, L8 means the least significant octet. For the sake of abbreviation, each octet can be written as
P-octet
R-octet
Cell-octet
Where appropriate, it can be further abbreviated as G, P, R and C. The value of any octet should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of group, plane, row and character position, it should be expressed in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group octet and the plane octet) can be omitted. For example: 0030 represents DIGII ZERO (whole character
Each + space: 251×256
character
positions.1
GB 13000.1-93
(Group7F)
19 graphic symbolgraphic symol
visual representation of a graphic character or composite sequence. 4.20 group
a division of the code space of a coded character set, having 256 × 256 × 256 bits. 4.21 interchange
the transmission of character coded data from one user to another by means of communication or an interchangeable medium. 4.22 interworking
a process that allows two or more systems using different coded character sets to interchange character coded data exactly, which may involve conversion between two codes. 4.23 octet
an ordered sequence of eight bits considered as a whole. 4.24 plane
a division of paper, having 256 × 256 bits. 4.25 presentation + to present GB 13000.1-93
The process of writing, printing or displaying a graphic symbol. 4.26 representation form In some literary representations, a certain graphic symbol form of a character, which depends on the position of the character relative to other characters.
4.27 private use planes A set of encoding symbols whose contents are not specified in GB 13000 (see 10.1). 4.28 vocabulary
A set of specified characters represented by a terminal code character frame, 4.29 row
A unit of division of a half, which has 256 bits. 4.30 text seript
A set of morphological symbols used in a book of one or more languages. 4.31 Supplementary planes planes for characters not included in the basic multilingual plane. 4.32 User A person or other entity that uses the services provided by a device (for example, if the device is a transcoder or gateway function, the user entity may be an application program). 4.33 A sequence of characters in a code table consisting of one or more lines (whole or part of a line) containing characters of a particular class (see Chapter 8). 5 UCS overall structure This chapter describes the overall structure of the universal multi-octet coded character set (hereinafter referred to as the "universal coded character set") and illustrates it in Figures 1 and 2. The specification of this structure is given in the following chapters. In GB13000, any eight-bit character is represented in hexadecimal notation from 00 to FF. See Appendix J (reference number). The canonical form of this code word set and its expression method use "a four-dimensional coding space which is represented as a single entity and consists of 12B three-dimensional groups. Note: In this regard, the 8th bit of the most significant eight bits in the canonical form of the abbreviated character is used for internal processing of the device. As long as its value is set to zero within the standard coding character data element rate, each group contains 256 two-dimensional planes. Each of them contains 256 one-dimensional rows, and each row contains 256 word positions. A character is arranged and encoded on a word position in this coding space. Then, it is declared that the glyph is not used. In the stop point, four octets are used to represent each symbol, and the group, plane, row and glyph are indicated accordingly. Since two octets are not enough to contain the symbols on the boundary, and the 32-bit representation is in line with the architecture of modern processing systems, the regular form consists of four octets.
The four-octet regular form can be used as a four-octet coded character set, which is called CS-4. The first plane (the 00 plane in the 00 group) is called the basic script plane. It includes characters commonly used in alphabetic characters, syllables and ideographic characters. And various symbol numbers. BMF also has a restricted use (RUJ) area, in which symbols have special properties. The planes that follow are considered auxiliary planes or special planes, used to accommodate attached graphic symbols. The 32 planes with plane octet values ​​E0-FF in group 00 are special planes. The 32 groups with group octet values ​​60~7F in this coded character set are special groups. GB13000 does not specify the content of the character bits in the special area. Each character is arranged in this coded character set according to its group octet, plane octet, row octet, and character octet. In addition to the regular form, The national standard also specifies a double-octet BMP format. Therefore, this multi-language plane can be used as a double-octet coded character set, marked as UCS.2.
GB13000.1-93
In order to give a vocabulary of graphic characters, it is called using a subset of the coding space. Appendix G (reference) specifies a transformation format of UCS (UTF-1), which can be used to transmit text data in communication systems that are sensitive to the octet values ​​of control characters encoded according to the (B2311 structure). 6 Basic structure and terminology
6.1 Station structure
The universal multi-octet coded character set specified in GB13000 should be regarded as a single entity. The entire coded character set should be expressed as containing 128 groups, each of which has 256 planes. Each plane should be regarded as containing 256 There are 256 characters in each row. In the code table representing the plane content (such as Figure 2), the horizontal axis should represent the lowest octet, and the left octet has the smallest value; while the vertical axis should represent the higher octet, and the top octet has the smallest value. Each axis in the coding space should be encoded with an octet. In each octet, the highest bit should be the 8th bit, and the lowest bit should be the 1st bit.
Accordingly, the weight of each bit should be: 8th bit
7th bit
6.2 Character Encoding
6th bit
5th bit
4th bit
2nd bit
1st bit
A 4-bit sequence is represented. In the canonical form of the coded character set, each character in the entire coded character set must be composed of the highest octet and the lowest octet should be the lowest octet. Thus, this sequence can be expressed as: TTL. s.
Group octet
(Graup-rtet)
Plane-octet
(Plane-octet)
Where, ms means the most significant octet, L8 means the least significant octet. For the sake of abbreviation, each octet can be written as
P-octet
R-octet
Cell-octet
Where appropriate, it can be further abbreviated as G, P, R and C. The value of any octet should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of group, plane, row and character position, it should be expressed in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group octet and the plane octet) can be omitted. For example: 0030 represents DIGII ZERO (whole character
Each + space: 251×256
character
positions.1
GB 13000.1-93
(Group7F)
19 graphic symbolgraphic symol
visual representation of a graphic character or composite sequence. 4.20 group
a division of the code space of a coded character set, having 256 × 256 × 256 bits. 4.21 interchange
the transmission of character coded data from one user to another by means of communication or an interchangeable medium. 4.22 interworking
a process that allows two or more systems using different coded character sets to interchange character coded data exactly, which may involve conversion between two codes. 4.23 octet
an ordered sequence of eight bits considered as a whole. 4.24 plane
a division of paper, having 256 × 256 bits. 4.25 presentation + to present GB 13000.1-93
The process of writing, printing or displaying a graphic symbol. 4.26 representation form In some literary representations, a certain graphic symbol form of a character, which depends on the position of the character relative to other characters.
4.27 private use planes A set of encoding symbols whose contents are not specified in GB 13000 (see 10.1). 4.28 vocabulary
A set of specified characters represented by a terminal code character frame, 4.29 row
A unit of division of a half, which has 256 bits. 4.30 text seript
A set of morphological symbols used in a book of one or more languages. 4.31 Supplementary planes planes for characters not included in the basic multilingual plane. 4.32 User A person or other entity that uses the services provided by a device (for example, if the device is a transcoder or gateway function, the user entity may be an application program). 4.33 A sequence of characters in a code table consisting of one or more lines (whole or part of a line) containing characters of a particular class (see Chapter 8). 5 UCS overall structure This chapter describes the overall structure of the universal multi-octet coded character set (hereinafter referred to as the "universal coded character set") and illustrates it in Figures 1 and 2. The specification of this structure is given in the following chapters. In GB13000, any eight-bit character is represented in hexadecimal notation from 00 to FF. See Appendix J (reference number). The canonical form of this code word set and its expression method use "a four-dimensional coding space which is represented as a single entity and consists of 12B three-dimensional groups. Note: In this regard, the 8th bit of the most significant eight bits in the canonical form of the abbreviated character is used for internal processing of the device. As long as its value is set to zero within the standard coding character data element rate, each group contains 256 two-dimensional planes. Each of them contains 256 one-dimensional rows, and each row contains 256 word positions. A character is arranged and encoded on a word position in this coding space. Then, it is declared that the glyph is not used. In the stop point, four octets are used to represent each symbol, and the group, plane, row and glyph are indicated accordingly. Since two octets are not enough to contain the symbols on the boundary, and the 32-bit representation is in line with the architecture of modern processing systems, the regular form consists of four octets.
The four-octet regular form can be used as a four-octet coded character set, which is called CS-4. The first plane (the 00 plane in the 00 group) is called the basic script plane. It includes characters commonly used in alphabetic characters, syllables and ideographic characters. And various symbol numbers. BMF also has a restricted use (RUJ) area, in which symbols have special properties. The planes that follow are considered auxiliary planes or special planes, used to accommodate attached graphic symbols. The 32 planes with plane octet values ​​E0-FF in group 00 are special planes. The 32 groups with group octet values ​​60~7F in this coded character set are special groups. GB13000 does not specify the content of the character bits in the special area. Each character is arranged in this coded character set according to its group octet, plane octet, row octet, and character octet. In addition to the regular form, The national standard also specifies a double-octet BMP format. Therefore, this multi-language plane can be used as a double-octet coded character set, marked as UCS.2.
GB13000.1-93
In order to give a vocabulary of graphic characters, it is called using a subset of the coding space. Appendix G (reference) specifies a transformation format of UCS (UTF-1), which can be used to transmit text data in communication systems that are sensitive to the octet values ​​of control characters encoded according to the (B2311 structure). 6 Basic structure and terminology
6.1 Station structure
The universal multi-octet coded character set specified in GB13000 should be regarded as a single entity. The entire coded character set should be expressed as containing 128 groups, each of which has 256 planes. Each plane should be regarded as containing 256 There are 256 characters in each row. In the code table representing the plane content (such as Figure 2), the horizontal axis should represent the lowest octet, and the left octet has the smallest value; while the vertical axis should represent the higher octet, and the top octet has the smallest value. Each axis in the coding space should be encoded with an octet. In each octet, the highest bit should be the 8th bit, and the lowest bit should be the 1st bit.
Accordingly, the weight of each bit should be: 8th bit
7th bit
6.2 Character Encoding
6th bit
5th bit
4th bit
2nd bit
1st bit
A 4-bit sequence is represented. In the canonical form of the coded character set, each character in the entire coded character set must be composed of the highest octet and the lowest octet should be the lowest octet. Thus, this sequence can be expressed as: TTL. s.
Group octet
(Graup-rtet)
Plane-octet
(Plane-octet)
Where, ms means the most significant octet, L8 means the least significant octet. For the sake of abbreviation, each octet can be written as
P-octet
R-octet
Cell-octet
Where appropriate, it can be further abbreviated as G, P, R and C. The value of any octet should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of group, plane, row and character position, it should be expressed in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group octet and the plane octet) can be omitted. For example: 0030 represents DIGII ZERO (whole character
Each + space: 251×256
character
positions.1
GB 13000.1-93
(Group7F)
26 PRCscniation form In some literary representations, a certain graphical symbol form for a character, which depends on the position of the character relative to other characters.
4.27 Private use planes A plane whose contents are not specified in GB13000 (see 10.1). 4.28 Vocabulary
A specified set of characters represented by a terminal code frame, 4.29 Row
A unit of division of half a character, which has 256 glyphs. 4.30 Script seript
A set of morphological symbols used in a book of one or more languages. 4.31 Supplementary planes A plane containing those symbols not arranged in the basic multilingual plane. 4.32 User: An individual or other entity that enjoys the services provided by a device. (For example, if the device is a transcoder or gateway function, the user entity may be an application program such as an application program.) 4.33: A sequence of characters in a code table. It consists of one or more lines (whole lines or partial lines) containing characters of a specific class (see Chapter 8).
5 UCS overall structure
This chapter describes the overall structure of the universal multi-octet coded character set (hereinafter referred to as "this coded character set") and illustrates it in Figures 1 and 2. The specifications of this structure are given in the following chapters. In GB13000, any eight-bit character is represented by hexadecimal notation from 00 to FF. See Appendix J (reference number). The canonical form of this code word set and its expression method use "a four-dimensional coding space which is represented as a single entity and consists of 12B three-dimensional groups. Note: In this regard, the 8th bit of the most significant eight bits in the canonical form of the abbreviated character is used for internal processing of the device. As long as its value is set to zero within the standard coding character data element rate, each group contains 256 two-dimensional planes. Each of them contains 256 one-dimensional rows, and each row contains 256 word positions. A character is arranged and encoded on a word position in this coding space. Then, it is declared that the glyph is not used. In the stop point, four octets are used to represent each symbol, and the group, plane, row and glyph are indicated accordingly. Since two octets are not enough to contain the symbols on the boundary, and the 32-bit representation is in line with the architecture of modern processing systems, the regular form consists of four octets.
The four-octet regular form can be used as a four-octet coded character set, which is called CS-4. The first plane (the 00 plane in the 00 group) is called the basic script plane. It includes characters commonly used in alphabetic characters, syllables and ideographic characters. And various symbol numbers. BMF also has a restricted use (RUJ) area, in which symbols have special properties. The planes that follow are considered auxiliary planes or special planes, used to accommodate attached graphic symbols. The 32 planes with plane octet values ​​E0-FF in group 00 are special planes. The 32 groups with group octet values ​​60~7F in this coded character set are special groups. GB13000 does not specify the content of the character bits in the special area. Each character is arranged in this coded character set according to its group octet, plane octet, row octet, and character octet. In addition to the regular form, The national standard also specifies a double-octet BMP format. Therefore, this multi-language plane can be used as a double-octet coded character set, marked as UCS.2.
GB13000.1-93
In order to give a vocabulary of graphic characters, it is called using a subset of the coding space. Appendix G (reference) specifies a transformation format of UCS (UTF-1), which can be used to transmit text data in communication systems that are sensitive to the octet values ​​of control characters encoded according to the (B2311 structure). 6 Basic structure and terminology
6.1 Station structure
The universal multi-octet coded character set specified in GB13000 should be regarded as a single entity. The entire coded character set should be expressed as containing 128 groups, each of which has 256 planes. Each plane should be regarded as containing 256 There are 256 characters in each row. In the code table representing the plane content (such as Figure 2), the horizontal axis should represent the lowest octet, and the left octet has the smallest value; while the vertical axis should represent the higher octet, and the top octet has the smallest value. Each axis in the coding space should be encoded with an octet. In each octet, the highest bit should be the 8th bit, and the lowest bit should be the 1st bit.
Accordingly, the weight of each bit should be: 8th bit
7th bit
6.2 Character Encoding
6th bit
5th bit
4th bit
2nd bit
1st bit
A 4-bit sequence is represented. In the canonical form of the coded character set, each character in the entire coded character set must be composed of the highest octet and the lowest octet should be the lowest octet. Thus, this sequence can be expressed as: TTL. s.
Group octet
(Graup-rtet)
Plane-octet
(Plane-octet)
Where, ms means the most significant octet, L8 means the least significant octet. For the sake of abbreviation, each octet can be written as
P-octet
R-octet
Cell-octet
Where appropriate, it can be further abbreviated as G, P, R and C. The value of any octet should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of group, plane, row and character position, it should be expressed in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group octet and the plane octet) can be omitted. For example: 0030 represents DIGII ZERO (whole character
Each + space: 251×256
character
positions.1
GB 13000.1-93
(Group7F)
26 PRCscniation form In some literary representations, a certain graphical symbol form for a character, which depends on the position of the character relative to other characters.
4.27 Private use planes A plane whose contents are not specified in GB13000 (see 10.1). 4.28 Vocabulary
A specified set of characters represented by a terminal code frame, 4.29 Row
A unit of division of half a character, which has 256 glyphs. 4.30 Script seript
A set of morphological symbols used in a book of one or more languages. 4.31 Supplementary planes A plane containing those symbols not arranged in the basic multilingual plane. 4.32 User: An individual or other entity that enjoys the services provided by a device. (For example, if the device is a transcoder or gateway function, the user entity may be an application program such as an application program.) 4.33: A sequence of characters in a code table. It consists of one or more lines (whole lines or partial lines) containing characters of a specific class (see Chapter 8).
5 UCS overall structure
This chapter describes the overall structure of the universal multi-octet coded character set (hereinafter referred to as "this coded character set") and illustrates it in Figures 1 and 2. The specifications of this structure are given in the following chapters. In GB13000, any eight-bit character is represented by hexadecimal notation from 00 to FF. See Appendix J (reference number). The canonical form of this code word set and its expression method use "a four-dimensional coding space which is represented as a single entity and consists of 12B three-dimensional groups. Note: In this regard, the 8th bit of the most significant eight bits in the canonical form of the abbreviated character is used for internal processing of the device. As long as its value is set to zero within the standard coding character data element rate, each group contains 256 two-dimensional planes. Each of them contains 256 one-dimensional rows, and each row contains 256 word positions. A character is arranged and encoded on a word position in this coding space. Then, it is declared that the glyph is not used. In the stop point, four octets are used to represent each symbol, and the group, plane, row and glyph are indicated accordingly. Since two octets are not enough to contain the symbols on the boundary, and the 32-bit representation is in line with the architecture of modern processing systems, the regular form consists of four octets.
The four-octet regular form can be used as a four-octet coded character set, which is called CS-4. The first plane (the 00 plane in the 00 group) is called the basic script plane. It includes characters commonly used in alphabetic characters, syllables and ideographic characters. And various symbol numbers. BMF also has a restricted use (RUJ) area, in which symbols have special properties. The planes that follow are considered auxiliary planes or special planes, used to accommodate attached graphic symbols. The 32 planes with plane octet values ​​E0-FF in group 00 are special planes. The 32 groups with group octet values ​​60~7F in this coded character set are special groups. GB13000 does not specify the content of the character bits in the special area. Each character is arranged in this coded character set according to its group octet, plane octet, row octet, and character octet. In addition to the regular form, The national standard also specifies a double-octet BMP format. Therefore, this multi-language plane can be used as a double-octet coded character set, marked as UCS.2.
GB13000.1-93
In order to give a vocabulary of graphic characters, it is called using a subset of the coding space. Appendix G (reference) specifies a transformation format of UCS (UTF-1), which can be used to transmit text data in communication systems that are sensitive to the octet values ​​of control characters encoded according to the (B2311 structure). 6 Basic structure and terminology
6.1 Station structure
The universal multi-octet coded character set specified in GB13000 should be regarded as a single entity. The entire coded character set should be expressed as containing 128 groups, each of which has 256 planes. Each plane should be regarded as containing 256 There are 256 characters in each row. In the code table representing the plane content (such as Figure 2), the horizontal axis should represent the lowest octet, and the left octet has the smallest value; while the vertical axis should represent the higher octet, and the top octet has the smallest value. Each axis in the coding space should be encoded with an octet. In each octet, the highest bit should be the 8th bit, and the lowest bit should be the 1st bit.
Accordingly, the weight of each bit should be: 8th bit
7th bit
6.2 Character Encoding
6th bit
5th bit
4th bit
2nd bit
1st bit
A 4-bit sequence is represented. In the canonical form of the coded character set, each character in the entire coded character set must be composed of the highest octet and the lowest octet should be the lowest octet. Thus, this sequence can be expressed as: TTL. s.
Group octet
(Graup-rtet)
Plane-octet
(Plane-octet)
Where, ms means the most significant octet, L8 means the least significant octet. For the sake of abbreviation, each octet can be written as
P-octet
R-octet
Cell-octet
Where appropriate, it can be further abbreviated as G, P, R and C. The value of any octet should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of group, plane, row and character position, it should be expressed in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group octet and the plane octet) can be omitted. For example: 0030 represents DIGII ZERO (whole character
Each + space: 251×256
character
positions.1
GB 13000.1-93
(Group7F)
Any eight-bit character is represented by hexadecimal notation from 00 to FF, see Appendix J (reference numbered document). The canonical form of this code character set, its expression method, uses "a four-dimensional coding space that is represented as a single entity and consists of 12B three-dimensional groups,
Note: In this regard, the 8th highest eight bits in the canonical form of the abbreviated character are used for internal processing of the device, as long as its value is set to zero within the standard coding character data element rate.
Each group contains 256 two-dimensional planes. Each of them contains 256 one-dimensional rows, and each row contains 256 bits. A character is arranged and encoded on a bit in this coding space, otherwise Then, it is declared that the glyph is not used. In the stop point, four octets are used to represent each symbol, and the group, plane, row and glyph are indicated accordingly. Since two octets are not enough to contain the symbols on the boundary, and the 32-bit representation is in line with the architecture of modern processing systems, the regular form consists of four octets.
The four-octet regular form can be used as a four-octet coded character set, which is called CS-4. The first plane (the 00 plane in the 00 group) is called the basic script plane. It includes characters commonly used in alphabetic characters, syllables and ideographic characters. And various symbol numbers. BMF also has a restricted use (RUJ) area, in which symbols have special properties. The planes that follow are considered auxiliary planes or special planes, used to accommodate attached graphic symbols. The 32 planes with plane octet values ​​E0-FF in group 00 are special planes. The 32 groups with group octet values ​​60~7F in this coded character set are special groups. GB13000 does not specify the content of the character bits in the special area. Each character is arranged in this coded character set according to its group octet, plane octet, row octet, and character octet. In addition to the regular form, The national standard also specifies a double-octet BMP format. Therefore, this multi-language plane can be used as a double-octet coded character set, marked as UCS.2.
GB13000.1-93
In order to give a vocabulary of graphic characters, it is called using a subset of the coding space. Appendix G (reference) specifies a transformation format of UCS (UTF-1), which can be used to transmit text data in communication systems that are sensitive to the octet values ​​of control characters encoded according to the (B2311 structure). 6 Basic structure and terminology
6.1 Station structure
The universal multi-octet coded character set specified in GB13000 should be regarded as a single entity. The entire coded character set should be expressed as containing 128 groups, each of which has 256 planes. Each plane should be regarded as containing 256 There are 256 characters in each row. In the code table representing the plane content (such as Figure 2), the horizontal axis should represent the lowest octet, and the left octet has the smallest value; while the vertical axis should represent the higher octet, and the top octet has the smallest value. Each axis in the coding space should be encoded with an octet. In each octet, the highest bit should be the 8th bit, and the lowest bit should be the 1st bit.
Accordingly, the weight of each bit should be: 8th bit
7th bit
6.2 Character Encoding
6th bit
5th bit
4th bit
2nd bit
1st bit
A 4-bit sequence is represented. In the canonical form of the coded character set, each character in the entire coded character set must be composed of the highest octet and the lowest octet should be the lowest octet. Thus, this sequence can be expressed as: TTL. s.
Group octet
(Graup-rtet)
Plane-octet
(Plane-octet)
Where, ms means the most significant octet, L8 means the least significant octet. For the sake of abbreviation, each octet can be written as
P-octet
R-octet
Cell-octet
Where appropriate, it can be further abbreviated as G, P, R and C. The value of any octet should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of group, plane, row and character position, it should be expressed in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group octet and the plane octet) can be omitted. For example: 0030 represents DIGII ZERO (whole character
Each + space: 251×256
character
positions.1
GB 13000.1-93
(Group7F)
Any eight-bit character is represented by hexadecimal notation from 00 to FF, see Appendix J (reference numbered document). The canonical form of this code character set, its expression method, uses "a four-dimensional coding space that is represented as a single entity and consists of 12B three-dimensional groups,
Note: In this regard, the 8th highest eight bits in the canonical form of the abbreviated character are used for internal processing of the device, as long as its value is set to zero within the standard coding character data element rate.
Each group contains 256 two-dimensional planes. Each of them contains 256 one-dimensional rows, and each row contains 256 bits. A character is arranged and encoded on a bit in this coding space, otherwise Then, it is declared that the glyph is not used. In the stop point, four octets are used to represent each symbol, and the group, plane, row and glyph are indicated accordingly. Since two octets are not enough to contain the symbols on the boundary, and the 32-bit representation is in line with the architecture of modern processing systems, the regular form consists of four octets.
The four-octet regular form can be used as a four-octet coded character set, which is called CS-4. The first plane (the 00 plane in the 00 group) is called the basic script plane. It includes characters commonly used in alphabetic characters, syllables and ideographic characters. And various symbol numbers. BMF also has a restricted use (RUJ) area, in which symbols have special properties. The planes that follow are considered auxiliary planes or special planes, used to accommodate attached graphic symbols. The 32 planes with plane octet values ​​E0-FF in group 00 are special planes. The 32 groups with group octet values ​​60~7F in this coded character set are special groups. GB13000 does not specify the content of the character bits in the special area. Each character is arranged in this coded character set according to its group octet, plane octet, row octet, and character octet. In addition to the regular form, The national standard also specifies a double-octet BMP format. Therefore, this multi-language plane can be used as a double-octet coded character set, marked as UCS.2.
GB13000.1-93
In order to give a vocabulary of graphic characters, it is called using a subset of the coding space. Appendix G (reference) specifies a transformation format of UCS (UTF-1), which can be used to transmit text data in communication systems that are sensitive to the octet values ​​of control characters encoded according to the (B2311 structure). 6 Basic structure and terminology
6.1 Station structure
The universal multi-octet coded character set specified in GB13000 should be regarded as a single entity. The entire coded character set should be expressed as containing 128 groups, each of which has 256 planes. Each plane should be regarded as containing 256 There are 256 characters in each row. In the code table representing the plane content (such as Figure 2), the horizontal axis should represent the lowest octet, and the left octet has the smallest value; while the vertical axis should represent the higher octet, and the top octet has the smallest value. Each axis in the coding space should be encoded with an octet. In each octet, the highest bit should be the 8th bit, and the lowest bit should be the 1st bit.
Accordingly, the weight of each bit should be: 8th bit
7th bit
6.2 Character Encoding
6th bit
5th bit
4th bit
2nd bit
1st bit
A 4-bit sequence is represented. In the canonical form of the coded character set, each character in the entire coded character set must be composed of the highest octet and the lowest octet should be the lowest octet. Thus, this sequence can be expressed as: TTL. s.
Group octet
(Graup-rtet)
Plane-octet
(Plane-octet)
Where, ms means the most significant octet, L8 means the least significant octet. For the sake of abbreviation, each octet can be written as
P-octet
R-octet
Cell-octet
Where appropriate, it can be further abbreviated as G, P, R and C. The value of any octet should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of group, plane, row and character position, it should be expressed in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group octet and the plane octet) can be omitted. For example: 0030 represents DIGII ZERO (whole character
Each + space: 251×256
character
positions.1
GB 13000.1-93
(Group7F)
It can also be further abbreviated as G, P, R and C. The value of any eight bits should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of the group, plane, row and word bit, it should be represented in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group eight bits and the plane eight bits) can be omitted. For example: 0030 means DIGII ZERO (whole word
each + space: 251×256
words to do
(Each plant:
256×236
character
positions.1
GB 13000.1-93
(Group7F)
It can also be further abbreviated as G, P, R and C. The value of any eight bits should be represented by two hexadecimal digits. For example: 31 or FE. If you want to identify a single character by the value of the group, plane, row and word bit, it should be represented in the following form: 0000 0030 represents DIGIT ZERO (digit 0) 0000 0041 represents LATIN CAFITAL [ETTER A (Latin capital letter A) When referencing a character in a plane, the first four zeros (representing the group eight bits and the plane eight bits) can be omitted. For example: 0030 means DIGII ZERO (whole word
each + space: 251×256
words to do
(Each plant:
256×236
character
positions.1
GB 13000.1-93
(Group7F)
Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.