Basic principles and methods for information classifying and coding
Some standard content:
ICS01.120
National Standard of the People's Republic of China
GB/T7027—2002
Replaces GB/T7027—1986
Basic principles and methods for informationclassifying and coding
Group Co., Ltd.
Data room
2002-07-18 issued
People's Republic of China
General Administration of Quality Supervision, Inspection and Quarantine
2002-12-01 implementation
Normative reference documents
Terms and definitions
4 Classification and coding of information
4.1 Information classification
4.2 Information coding
5 Basic principles of information classification
5.1 Scientificity
Systematicity
5.3 Scalability
5.4 Compatibility
5.5 Comprehensive practicality
6 Basic methods of information classification||tt ||Line classification method
Surface classification method
6.4 Mixed classification method
7 Basic principles of information coding
Uniqueness
Reasonableness
Expandability
Simplicity
Applicability
Normativity
8 Basic methods of information coding
Code type
Code characteristics
Code expression form
8.5 Code design
8.6 Code assignment convention.
Appendix A (Informative Appendix) Advantages and disadvantages of various information classification and coding methods A.1 Advantages and disadvantages of information classification methods
A.2 Advantages and disadvantages of various types of code coding methods Industry
GB/T 7027--2002
GB/T7027-2002
This standard is a revision of GB/T7027-1986 "Guidelines for Standardization Work - Basic Principles and Methods for Information Classification and Coding". In terms of information coding, this standard refers to the international technical report ISO/IECTR9789:1994 (E) "Information technology - Guidelines for the organization and representation of data elements for data interchange - Coding methods and principles", and adopts the more mature relevant technical content. This standard replaces GB/T7027-1986 "Guidelines for Standardization Work - Basic Principles and Methods for Information Classification and Coding". Compared with GB/T7027-1986, the main changes made in this revision are: - The name of the standard has been changed. The name of the standard has been changed to "Basic Principles and Methods for Information Classification and Coding". - The overall arrangement and structure of the standard have been modified according to GB/T1.1-2000. The table of contents, preface, introduction and Appendix A have been added and the contents of the original standard have been added and deleted accordingly. The supplementary contents include: Chapter 2 "Normative Reference Documents", Chapter 3 "Terms and Definitions" and Chapter 4 "Information Classification and Coding" Overview. The deleted contents are: Article 2.4 "Code Verification" related algorithms of the original standard.
The structure of the original standard has been adjusted: Article 1.1 "Basic Principles of Information Classification" of the original standard has been adjusted to Chapter 5, Article 1.2 "Basic Methods of Information Classification" of the original standard has been adjusted to Chapter 6, Article 2.2 "Basic Principles of Coding" of the original standard has been adjusted to Chapter 7, Article 2.3 "Types of Codes" and Article 2.5 "Types of Codes" of the original standard and the relevant technical contents of ISO/IECTR9789 have been sorted out to form Chapter 8 "Basic Methods of Information Coding", and the advantages and disadvantages of various information classification and coding methods described in the original standard have been summarized and adjusted to "Appendix A Advantages and Disadvantages of Various Information Classification and Coding Methods". Several adjustments have been made to the code names in the original standard: the "characteristic combination code" in the original standard corresponds to the "juxtaposition code" in this standard, the "composite code" in the original standard corresponds to the "combination code" in this standard, and the "numerical alphabetical sequence code" in the original standard is covered by the "agreed sequence code" in this standard.
In the field of information classification and coding standardization, this standard should be used in conjunction with GB/T20001.32001 "Standard Writing Rules Part 3: Information Classification Coding" and GB/T10113 "General Terms for Classification and Coding". Appendix A of this standard is an informative appendix.
This standard is proposed and managed by the China Standards Research Center. The main drafting unit of this standard: China Standards Research Center. The main drafters of this standard: Li Xiaolin, Feng Wei, Hu Jiazhang. GB/T7027 was first published in November 1986. This revision is the first revision GB/T7027—2002
In general, people understand information as: the truth and related statements of all meaningful concrete or abstract things or concepts, expressed through data, messages and their further details. In the field of information classification and coding, the form of information is data. Objective and clear information is a prerequisite for computers to establish information systems and exchange data in them. In information systems, data is represented by characters (usually numbers or letters), arithmetic symbols and descriptions. These representations should have a clear and stable meaning for each data involved, so as to achieve the purpose of processing and communication. If information is to be shared by different user groups or application systems, it must have a consensus definition, for example, the semantic meaning (connotation) of the concept, all instances (extension) of the concept and a consensus representation. The correct understanding of various information concepts depends on information classification: the consensus representation of various information depends on information coding.
1 Scope
Basic principles and methods of information classification and coding GB/T7027-2002
This standard specifies the basic principles and methods of information classification and coding, and is applicable to the preparation of various information classification and coding standards. 2 Normative references
The clauses in the following documents become clauses of this standard through reference in this standard. For any dated referenced document, all subsequent amendments (excluding errata) or revisions are not applicable to this standard. However, the parties who reach an agreement based on this standard are encouraged to study whether the latest versions of these documents can be used. For any undated referenced document, the latest version shall apply to this standard. GB/T1988—1998 Information technology Seven-bit coded character set for information exchange (eqvISO/IEC646:1991) GB2312—1980 Basic set of Chinese coded character set for information exchange GB/T2260-2002 Administrative division code of the People's Republic of China GB/T2659-2000 Code for names of countries and regions in the world (eqvISO3166-1:1997) GB/T4657-2002 Code for central party and government organs, people's organizations and other institutions GB/T7408-1994 Data elements and exchange formats Date and time representation for information exchange (eqvISO8601:1988) GB/T10113 General terminology for classification and coding
GB11643—1999 Citizen identity number||tt ||GB/T13745-1992 Discipline classification and code GB/T14721.1-1993 Forestry resource classification and code Forest type GB/T14805-1993 Application-level syntax rules for electronic data exchange for administration, commerce and transportation (idtISO9735:1988)
GB/T17710-1999 Data processing check code system (idtISO7064:1983) 3 Terms and definitions
The terms established in GB/T10113 apply to this standard 4 Classification and coding of information
4.1 Information classification
Information classification is to distinguish and classify information according to certain principles and methods based on the attributes or characteristics of the information content, and establish a certain classification system and arrangement order.
Information classification has two elements: one is the classification object, and the other is the basis for classification. The classification object consists of several classified entities. The basis for classification depends on the attributes or characteristics of the classification object. The similarities or differences in the attributes of information content form various classes. In the information classification system, classes can be called categories. 4.2 Information Coding
Information coding is to assign symbols with certain rules and easy for computers and people to recognize and process things or concepts (coding objects) to form a set of code elements. The code elements in the code element set are the symbols assigned to the coding objects, that is, the code values of the coding objects. All types of information can be coded: such as various information about products, people, countries, currencies, programs, files, components, etc. Information coding includes: the method of expressing data into codes, the code representation form of data, and the assignment of code element sets. The main functions of information coding are: identification, classification, and reference. GB/T7027-2002
The purpose of identification is to distinguish the coding objects. Within the set of coding objects, the code value of the coding object is its unique mark. The classification function of information coding is essentially to identify the class; the reference function of information coding is reflected in the fact that the code value of the coding object can be used as a keyword for association between different application systems or application fields. 5 Basic principles of information classification
5.1 Scientificity
It is advisable to select the most stable essential attributes or characteristics of things or concepts (i.e., classification objects) as the basis and basis for classification. 5.2 Systematicity
The attributes or characteristics of the selected things and concepts should be systematized in a certain order to form a scientific and reasonable classification system. 5.3 Extensibility
Usually, it is necessary to set up accommodation categories to ensure that when new things or concepts are added, the established classification system is not disrupted. At the same time, it should also provide support for lower-level information management systems in this classification system.
5.4 Compatibility
It should be compatible with relevant standards (including international standards). 5.5 Comprehensive practicality
Classification should be extended and refined based on the premise of system and general requirements to create services. 6.2 Line classification method
6.2.1 Method
That is, when meeting the overall task of the system, there are three methods: line classification method, surface classification method, and mixed classification method. Among them, line classification method is also called hierarchical classification method, and body classification method is also called combination classification method.
Linear classification is to classify the selected object (i.e. the thing or concept to be classified) into a corresponding hierarchical category with several levels of categories according to the selected attributes or characteristics. There are parallel categories between the divided categories. 6.2.2 Example
Subordinate categories. The categories that are directly divided from the gradually unfolded classification system are called hierarchical categories. The categories divided in this classification system are called homonymous categories. The homonymous category system
GB/T14721.1-199 "Classification and Code of Forest Types of Responsible Resources" adopts linear classification and is represented by a five-digit code. This standard divides forest types into 10 levels. The first level uses the same
three-digit code to represent the forest type group, the
first-digit code to represent the forest vegetation type, the second level uses the fourth and fifth digits to represent the forest type, and some codes are shown in Table 1. Table 1
Economic forest
Beverage forest
Tea forest
Coffee forest
Cocoa forest
Fresh fruit forest
Apple forest
Pear forest
Peach forest
Type name
In Table 1, economic forest is a superordinate category relative to beverage forest and fresh fruit forest, and beverage forest and fresh fruit forest are subordinate categories relative to economic forest. Class 2
GB/T7027-2002
, beverage forest and fresh fruit forest are co-ordinate categories: Similarly, beverage forest is relative to tea Forest, coffee forest, cocoa forest are superordinate categories, tea forest, coffee forest, cocoa forest are subordinate categories of beverage forest, tea forest, coffee forest, cocoa forest are homologous categories6.2.3 Requirements
a) The total scope of subordinate categories divided from a superordinate category should be equal to the scope of the superordinate category: When a superordinate category is divided into several subordinate categories, the same division basis should be selected: b
c) Homologous categories do not overlap or repeat, and only correspond to one superordinate category; d)
Classification should be carried out in sequence, and there should be no empty layers or added layers. 6.3 Facet classification method
6.3.1 Method
Face classification method regards several attributes or characteristics of the selected classification object as several "facets", and each "facet" can be divided into several independent categories. When in use, the categories in these "faces" can be combined together to form a composite category as needed. 6.3.2 Example
Clothing can be classified into several categories using the face
, see Table 2
Medium and long fibers
When in use, there will be
6.3.3 Requirements
Select
different "faces" as needed
Each "face" has
Select clothing materials, men's and women's styles, and clothing styles as three "faces", and each "face" can be divided into Table 2
Men's and women's martial arts styles
Mao suits
One-piece dresses
Clothing styles||t t||group them together. For example, the various "faces" of objects such as pure wool men's Zhongshan suits and medium-long fiber women's suits; the essential attributes or characteristics of the class object as a classification cannot be repeated: the fixed position of the day should not be mutually
. The determination of this position depends on actual needs. d) Selection of "faces"
6.4 Hybrid classification method
Hybrid classification method is a method of dividing lines
.
7 Basic principles of information coding
7.1 Uniqueness
One classification method is the main one, and the other is used as a supplement to the information classification method. In which
In a classification coding standard, each coding pair should have only one code, and one code only represents one coding object. 7.2 Reasonableness
The code structure should be compatible with the classification system. 7.3 Extensibility
The code should have appropriate backup capacity to meet the needs of continuous expansion. 7.4 Simplicity
The code structure should be as simple as possible and the length should be as short as possible to save machine storage space and reduce the error rate of the code. 7.5 Applicability
The code should reflect the characteristics of the coding object as much as possible, be applicable to different related application fields, and support system integration. 7.6 Normativity
GB/T70272002
In an information classification coding standard, the type of code, the structure of the code and the writing format of the code should be unified. 8 Basic methods of information coding||t t||8.1 General
The coding method should be based on the predetermined application requirements and the nature of the coding object, and the appropriate code structure should be selected. In the process of determining the code structure, it is necessary to consider the coding rules of various codes, the advantages and disadvantages of various codes (see Appendix A), analyze the general characteristics of the code, select the appropriate code expression form, study the various factors involved in the code design, and avoid potential adverse consequences. 8.2 Code Types
Figure 1 shows the types of various commonly used codes according to the meaning of the code (see 8.3.2). Code
Non-significant code
Sequential code
Health and service circle
8.2.1 Sequential code
8.2.1.1 Rules
Series service sequence
Conventional sequence code
Non-sequential code
Meaningful code
Characters are sequentially taken from an ordered character set and assigned to each coding object. These characters are usually natural integers, such as: starting with "1": They can also be alphabetic characters, such as: AAA, AAB, AAC, 8.2.1.2 Application
Sequential codes are generally used as independent codes for identification or reference purposes, or as part of a composite code, the latter of which is often attached with a classification code.
In a numeric field with fixed code positions, the number of digits in the field should be filled with zeros until the code position requirements are met. Example: In a 3-digit numeric field, the number 1 is encoded as 001, and the number 15 is encoded as 015. 8.2.1.3 Type
There are three types of sequential codes: increasing sequential codes, grouped sequential codes, and agreed sequential codes. 8.2.1.3.1 Incremental Sequential Code
The code value assigned to the coded object can be determined by the increment of a predetermined number. For example, the predetermined number can be 1 (purely incremental), or 10 (only multiples of 10 can be assigned), or other numbers (such as 2 in the case of an even number), etc. In this way, the code value does not carry any meaning. The code values of similar coded objects are not grouped. In order to modify the original code set in the future, it may be necessary to use intermediate code values, and the assignment basis of these intermediate code values does not have to be incremented by 1.
Example: The digital codes of some countries and regions in GB/T26592000 "Codes for the Names of Countries and Regions in the World" (see Table 3). 4
Country and region names
Afghanistan
Albania ALBANIA
Algeria ALGERIA
American Samoa AMERICANSAMOA
Andorra ANDORRA
Angola ANGOLA
GB/T70272002
In this standard, the later added regional name Antarctica (ANTARCTICA) uses the intermediate code value O1O, which is a supplement to the original code set. wwW.bzxz.Net
8.2.1.3.2 Series Sequential Code
This code must first determine the category of the coded object, determine the code value range according to each category, and then assign sub-code values to the coded object in sequence within the code value range of each category. Example: GB/T4657-2002 "Codes for Central Party and Government Organs, People's Organizations and Other Institutions" uses a three-digit series sequential code.
100~199
Indicates the National People's Congress, the National Committee of the Chinese People's Political Consultative Conference, the Supreme People's Procuratorate, and the Supreme People's Court. 200~299 Indicates the central government agencies and directly affiliated institutions. 300-399 Indicates the ministries and commissions of the State Council.
700~799 Indicates the national people's organizations and democratic parties. The serial sequence code can only be used under the condition that the categories are stable and each specific coding object is unlikely to belong to a different category at present or in the foreseeable future.
8.2.1.3.3 Conventional Sequence Code
Conventional sequence code is not a pure sequence code. This code can only be used successfully under the condition that all coding objects are known in advance and the set of coding objects will not be expanded. Before assigning code values, the coding objects should be arranged according to certain characteristics, such as: alphabetical order of names, chronological order (of events, activities), etc. The order obtained in this way is then expressed by code values, and these code values themselves should also be selected in sequence from an ordered list.
Example: Numerical alphabetical sequence code arranged in alphabetical order (see Table 4). Table 4
8.2.2 Unordered code
8.2.2.1 Rules
Apples
Bananas
Cherries
Dates
Unordered code is to assign unordered natural numbers or letters to the encoding object. This kind of code has no writing rules and is written by the random program of the machine.
8.2.2.2 Application
Unordered code can be used as the self-identification of the encoding object, and can also be used as a component of a composite code (the other parts of the composite code are based on other encoding rules).
GB/T70272002
8.2.3 Abbreviation codes
8.2.3.1 Rules
The essential feature of this code is to abbreviate the name of the coded object according to a unified method, and assign one or more characters from the name of the coded object to the coded representation.
8.2.3.2 Application
Abbreviation codes can be effectively used for those limited identification code sets that are quite stable and the names of the coded objects are well known in the user environment
Example: In GB/T26592000 "Codes for Names of Countries and Regions in the World", the letter codes of some countries are shown in Table 5. Table 5
Country Name
AustriaAUSTRIA
CanadaCANADA
ChinaCHINA
FranceFRANCE
UNITED STATY
8.2.4 Hierarchical Code
8.2.4.1 Rules
The hierarchical code is based on the hierarchical classification in the coding
time object set, and the coding object is condensed into continuous and increasing groups (classes). The hierarchical code at the higher level is subdivided into lower levels by each level upper level
-
8.2.4.2 Application
Each group (class) contains and can only contain all the groups (classes) at the lower level below it. This type of code must be based on the difference between the characteristics of the objects. The hierarchical code at each level is actually a composite code of the higher level code segment and the lower level code segment as shown in Figure 2: The second level code is the library code. The hierarchical code is usually used for the establishment of the number of levels of classification. The hierarchical code is rarely used for identification and reference purposes. The hierarchical code is very suitable for situations such as statistical self-report cargo movement and subject-based publication classification. In practice, there are both fixed formats and variable formats. Fixed formats are easier to handle than variable formats. Example 1: Fixed incremental format. The subject code format in GB/T13745-1992 "Subject Classification and Code" consists of 7 digits. The next level of subject increases by a fixed 2-digit code segment relative to the previous level of subject. Some of its codes are shown in Table 6. Table 6
110-14
110·1410
Mathematical Logic and Foundations of Mathematics
Deductive Logic
Subject Name
Example 2: Variable incremental format. In the Universal Decimal Classification (UDC), the number of characters and the segmentation of the coded expression are variable and the degree of detail can be extended to the desired level. A concept such as "roof slope in architecture" can be expressed in the coded expression 624.024.13.
624.024.13
8.2.5 Matrix code
8.2.5.1 Rules
Main wood work
Building components
Roof, roofing materials
Roof slope
GB/T70272002
Matrix code is based on the entity of a double record table. The values assigned to the rows and columns of this table are used to form the code representation of the coded object at the relevant coordinates in the table.
The advantage of this method is that the code pairs in the matrix table have several common characteristics.
8.2.5.2 Application
Matrix codes can be effectively used to mark code values with good meanings. These code objects have good structure and stability characteristics in different combinations. Examples of code objects: The row set in the table of GB2312-198
personnel in the district guide service building plate is encoded according to the matrix code encoding method for the basic graphic characters used for Chinese character information exchange. "啊" is represented by the area code 16
code, and the graphic characters
8.2.6 Concatenation code
8.2.6.1 Rules
Concatenation code is composed of independent
. This method
8.2.6.2 Application
Concatenation code is very
Application code segment
technical method.
indicates that in this
bit code 01
13 encoding represents
a composite code composed of these code segments,
, where the bit is the column number in the matrix. Chinese characters are bit-oriented: Similarly, the character 4\ uses the area code 0313 encoding to provide the characteristics of the encoding object. These characteristics are mutually exclusive
code expressions can be a combination of any type (sequential codes, abbreviation codes, and those commodity classifications with several common characteristics.
examples: track codes
grade, shape, and
unordered codes).
To make descriptive codes (what product, when and where it was produced) or to develop group technology for manufacturing
The characteristics are largely independent of each other.
8.2.7 Combination Codes
8.2.7.1 Rules
Combination codes are also composite codes composed of several codes, which provide different characteristics of the coded object. Unlike concatenation codes, these characteristics are interdependent and linked by hierarchy. 8.2.7.2 Application
Combination codes are often used for identification purposes to cover a wide range of application areas. Example: GB11643-1999 "Citizen Identity Number Citizen Identity Number
XXXXXXXXXXXXXXXXXX
XXXXXX
XXXXXXXX
18-digit combination code structure of citizen identity number Administrative division code
Date of birth
Sequence number, where odd numbers represent males and even numbers represent females Check code
The entire 18-digit combination code is divided into 4 segments. The first two code segments identify the spatial and temporal characteristics of the encoded object (citizen), the third code segment depends on the range limited by the first two code segments, and the fourth code segment depends on the check calculation results after the first three code segments are assigned. 8.3 Code Characteristics
GB/T70272002
8.3.1 Overview
In addition to the uniqueness, rationality, extensibility, simplicity and applicability discussed in Chapter 7 "Basic Principles of Information Coding", the general characteristics of codes also include: stability, meaningfulness, code length, structure and format, capacity and other characteristics. 8.3.2 Stability
When a code leaves room for design changes without having to modify its structure, the code is stable. Users need stable codes. The assignment of code values must consider the minimum possibility of accidental modification relative to the code value itself and the code structure. When a code element is withdrawn from the code element set, the original code representation should no longer be used for other coding objects. 8.3.3 Meaningfulness
A code is considered meaningful if its coding expressions express their meaning directly (e.g., abbreviation codes) or indirectly based on one or more tables (e.g., hierarchical codes, matrix codes and concatenation codes). When using coded expressions, meaningfulness is also related to the classification and grouping (classes) based on the characteristics of the coded object. In the case of classification, meaningfulness is particularly important. For identification and reference purposes, it is better to use meaningless codes. 8.3.4 Code length
Code length refers to the number of positions in the coded expression. The code length can be specified as a fixed or variable number of characters. Note: There are two main disadvantages of variable code length. First, when the number of characters in the data field storing the code value is more than the number of code value characters used, the unpredictable number of characters will cause alignment problems. Second, errors caused by redundant or added characters cannot be easily detected by humans or machines. Therefore, the code length should use a fixed number of characters. 8.3.5 Structure and format
The code structure definition includes: the number of positions or position groups that constitute the coded expression, and the set of valid characters at each position. Among them, spaces can be used as components of the structure. Checking for grammatical errors in input verification is mainly related to structure. For each position group, each position of the coded expression can be defined in the following format: alphabetic, numeric, alphanumeric, special characters, 8.3.6 Capacity
Capacity is the number of coded expressions that can be constructed from all possible combinations of characters in each position in the chosen radix.
Example: (C for capacity)
) For the number of positions a is 1 and the radix is 2 using ternary characters: b) For the number of positions 3 and the radix is 10 using decimal characters: c) For the number of positions 2 + the radix is 26 using alphabetic characters: C=2
C-1000
The theoretical capacity assumes that all possible combinations of all characters are used. Initial limitations for practical or theoretical reasons reduce this theoretical capacity. In fact, the choice of capacity is the result of a compromise between the following factors: a) the foresight of expanding the system:
b) the limit on the number of characters that make up the code expression: c) the ease of writing and using the code expression: d) the expected service life of the system:
e) the cost of operation, etc.
8.4 Code Representation
8.4.1 Digital Format Code
The digital format code is a code that uses one or more Arabic numerals to represent the coded object, referred to as the digital code for short. The characteristics of the digital code are simple structure, easy to use, easy to sort and easy to promote domestically and abroad. However, the description of the characteristics of the coded object is not intuitive.
When assigning the value of the digital format code, it is not appropriate to use values that are all 0 or all 9, such as "0000 and *9999". These values should be reserved for special cases.
Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.