Directives for the work of standardization--The basic principles and methods for information classifying and coding
Some standard content:
National Standard of the People's Republic of China
Guidelines for the work of standardization
Basic principles and methods for information classification and coding
Standards, principles, methods
UDC 025.4
GB7027 -- 86
Information classification and coding is an important part of information standardization work. This guideline introduces the basic principles and methods of information classification and coding. All systems and units should refer to it when formulating information classification and coding standards. 1 Information classification
Information classification is to distinguish and classify information according to certain principles and methods based on the attributes or characteristics of the information content, and establish a certain classification system and arrangement order to manage and use information. 1.1 Basic principles of information classification
Information classification must follow the following basic principles. 1.1.1 Scientificity
Usually, the most stable essential attributes or characteristics of things or concepts (i.e., classification objects) should be selected as the basis and basis for classification. 1.1.2 Systematicity
The attributes or characteristics of the selected things and concepts should be systematized in a certain order and a reasonable scientific classification system should be formed. 1.1.3 Extensibility
Usually, accommodation categories should be set up to ensure that when new things or concepts are added, the established classification system will not be disrupted. At the same time, conditions should be created for the subordinate information management system to extend and refine the classification system. 1.1.4 Compatibility
Coordinate with relevant standards (including international standards). 1.1.5 Comprehensive Practicality
Classification should start from the perspective of system engineering, and local problems should be handled in the overall system to achieve system optimization. That is, on the premise of meeting the overall tasks and requirements of the system, the actual needs of all relevant units in the system should be met as much as possible. 1.2 Basic methods of information classification
There are two basic methods of information classification: line classification and surface classification. 1.2.1 Line classification
Also known as hierarchical classification. It is to divide the initial classification object (i.e. the things or concepts to be divided) into corresponding hierarchical categories according to the selected attributes or characteristics (as the basis for classification), and arrange them into a hierarchical, step-by-step classification system. In this classification system, there is a parallel relationship between the categories of the same level; there is a subordinate relationship between the lower level and the upper level. The categories of the same level are not repeated or overlapped. The so-called upper level category: that is, in the online classification system, a category is separated from the lower level category directly divided from it, and is called the upper level category;
Lower level category: that is, in the online classification system, the lower level category directly divided from the upper level is called the lower level category relative to the upper level category; the same level category: that is, in the online classification system, the lower level category directly divided from the upper level is called the lower level category relative to the upper level category; Each level is called the same level, for example: GB2260-86 "Administrative Division Code of the People's Republic of China" adopts the linear classification method and is represented by a six-digit code. The National Bureau of Standards issued on November 25, 1986, and implemented on October 1, 1987. The national administrative divisions are divided into three levels, each of which is represented by a two-digit code. The first level is the province (autonomous region, municipality directly under the central government), represented by the first and second digits; the second level is the region (city, state, league), represented by the fourth digits, and the third level is the county (city, banner, town, district), represented by the fifth and sixth digits. The division and codes of some administrative divisions in Hebei Province are shown in Table 1: Table 1
132221
132222
Hebei Province
Shijiazhuang City
Tangshan Citybzxz.net
Xingtai Region
Xingtai County
Shahe County
As listed in Table 1, Hebei Province is a superordinate category relative to Shijiazhuang City, Tangshan City, and Xingtai Region; Shijiazhuang City, Tangshan City, and Xingtai Region are subordinate categories relative to Hebei Province. At the same time, Shijiazhuang City, Tangshan City, and Xingtai Region are of the same level. There is a parallel relationship between them. Similarly, Xingtai Region is a superordinate category relative to Xingtai County and Shahe County: Xingtai County and Shahe County are subordinate categories of Xingtai Region.
1.2.1.1 Principles of Line Classification
In line classification, the total scope of the lower-level categories divided from a certain superordinate category should be equal to the scope of its superordinate category: a.
When a superordinate category is divided into several lower-level categories, a division basis should be selected: categories of the same level should not overlap or repeat, and only correspond to ... superordinate categories; classification should be carried out in sequence, and there should be no empty layers or added layers. 1.2.1.2 Advantages of Line Classification
Good hierarchy, which can better reflect the logical relationship between categories; b.
Easy to use, which is in line with the traditional habit of manual information processing and convenient for electronic computers to process information. 1.2.1.3 Disadvantages of Line Classification
a. The structural flexibility is poor. Once the classification structure is determined, it is not easy to change; b. The efficiency is low. When there are many classification levels, the number of code bits is long, which affects the speed of data processing. 1.2.2 Surface Classification Method
Surface classification method is to regard several attributes or characteristics of the selected classification object as several "surfaces", and each "surface" can be divided into many independent categories. When using, the categories in these "surfaces" can be combined together to form a composite category according to needs.
For example: the classification of clothing can adopt the surface classification method, and the materials used for clothing, male and female styles, and clothing styles are selected as three "surfaces". Each "surface" can be divided into several categories, see Table 2: Table 2
Medium and long fibers
Male and female styles
Mao suit
Liannong jacket
When using, group the relevant categories together. For example, pure wool men's Mao suit. Medium and long fiber women's suits, etc. 9
1.2.2.1 Principles of surface classification
GB 702786
Select the essential attributes or characteristics of the classification object as the various "surfaces" of the classification object according to needs: a.
Classes in different "surfaces" should not overlap with each other, nor can they appear repeatedly; c
Each "surface" has a strictly fixed position; d.
The selection of "surface" and the determination of its position shall be determined according to actual needs. 1.2.2.2 Advantages of face classification method
It has great invertibility. Changes in the category H in one "face" will not affect other "faces"; b.
It has strong adaptability and can be composed of any category according to needs. It is also convenient for machines to process information: c.
It is easy to add and modify categories.
1.2.2.3 Disadvantages of face classification method
a. It cannot make full use of the capacity. There are many categories that can be combined. Sometimes there are not many categories for practical application: b. It is difficult to process information manually.
Line classification and surface classification each have their own advantages and disadvantages. In practice, due to the complexity of objective things, using one classification method alone sometimes cannot meet the requirements of users. Therefore, in practical applications, they can be used in combination according to the situation, with one classification method as the main one and the other as a supplement. Sometimes some special artificial regulations are required to meet the requirements of users. 2 Information Coding
Information coding is to give things or concepts (coding objects) symbols with certain regularity that are easy for computers and people to recognize and process
2.1 Code
A code is a symbol or a group of symbols that are ordered and easy for computers and people to recognize and process. Sometimes referred to as "codes", the functions of codes are as follows:
a. Identification: The code is the only mark to identify the coded object; b. Classification: When the coded object is classified according to its attributes or characteristics (such as process, material, purpose, etc.) and assigned different category codes respectively: the code can also be used as a mark to distinguish the category of the coded object; c. Sorting: When the coded object is classified according to the time of discovery (production), the space occupied or other sequential relationships, and assigned different codes respectively, the code meaning can be used as a mark to distinguish the sorting of the coded object: d. Specific meaning: When some special symbols are used for some objective needs, this code can also provide a certain specific meaning: e. Others: Other functions other than the above. Among the above functions of the code, the identification function is the most basic characteristic of the code, and any code must have this basic characteristic. The other functions of the code are selected by people to facilitate the processing and management of information, and are artificially assigned. 2.2 Basic principles of coding
2.2.1 Uniqueness
Although a coding object can have many different names and can be described in various ways, in a classification coding standard, each coding object has only one code, and one code only represents one coding object. 2.2.2 Rationality
The code structure should be compatible with the classification system, 2.2.3 Scalability
Appropriate backup capacity must be reserved to meet the needs of continuous expansion. 2.2.4 Simplicity
The code structure should be as simple as possible and as short as possible to save machine storage time and reduce the error rate of the code, while improving the efficiency of machine and machine processing.
2.2.5 Applicability
The code should reflect the characteristics of the coding object as much as possible. It can be used as a memory aid to facilitate filling in 10
2.2.6 Normativity
GB7027-86
In an information classification coding standard. The type of code, the structure of the code, and the encoding and format of the code must be unified. 2.3 Types of codes
There are many types of codes. The following are several main commonly used code structures and their advantages and disadvantages for selection when coding. The types and names of codes are as follows:
Meaningless code
Sequential code
Series sequential code
Non-sequential code
Digitalization
Student code
Hierarchical code
Meaningful code
Feature code
Composite code
Figure 1 is a basic code type commonly used in information coding. The code can be divided into meaningful code and meaningless code according to its function. Common meaningless codes include sequential code and non-sequential code. , the common meaning codes include serial sequence code, numerical character sequence code, hierarchical code, feature combination code, composite code
2.3.1 Meaningless code
Meaningless code is a code without actual meaning. This code is only used as the unique identifier of the coding object and only replaces the name of the coding object. It does not provide any other information about the coding object. Sequential code and non-sequential code are two commonly used meaningless codes. 2.3.1.1 Sequential code
Sequential code is the simplest and most commonly used code. This code assigns sequential natural numbers or letters to the coding object: for example: in GB226181 "Human Gender Code", 1 is male and 2 is female. Usually, non-systematic coding objects often use this code. Advantages of sequential code: The code is short. Easy to use, easy to manage, easy to add, and there is no special regulation and requirement for the order of the coding object.
Disadvantages of sequential code: The code itself does not give any specific information about the coding object. 2.3.1.2 Unordered code
Unordered code is to assign disordered numbers or letters to coding objects: this kind of code has no editing rules and is usually compiled by a random program of a machine
2.3.2 Meaningful code
A meaningful code is a code with some actual meaning. This kind of code is not only used as the unique identifier of the coding object, but also replaces the name of the coding object. It can also provide relevant information of the coding object (such as classification, sorting, logical meaning, etc.). Commonly used meaningful codes include series sequence code, numerical letter sequence code, hierarchical code, feature combination code, and composite code. 2.3.2.1 Series sequence code
The series sequence code is a special sequence code. This code divides the sequence code into several segments (series) and corresponds to the segmentation of the classification object, and assigns a certain sequence code to each segment of the classification object. This code is often used when encoding classification objects with a small classification depth. For example, GB4657-84 "Codes for the Names of Ministries, Commissions, Bureaus and Other Institutions of the State Council" adopts three-digit serial order codes, such as: 300-399 represents ministries and commissions of the State Council; 400-499 represents bureaus, offices and national bureau-level institutions directly under the ministries and commissions of the State Council, as well as advisory institutions and national academic institutions of the State Council;
700-799 represents national people's organizations. Advantages of serial order codes: can represent certain attributes or characteristics of the coded object, easy to add. Disadvantages of serial order codes: when there are many empty codes, it is not convenient for machine processing and is not suitable for complex classification systems. 2.3.2.2 Numerical alphabetical order codes
Numerical alphabetical order codes are codes written in the alphabetical order of the names of the coded objects. This code arranges all coded objects in the alphabetical order of their names, and then assigns them increasing digital codes respectively. For example: Numerical alphabetical order codes arranged in alphabetical order of English letters (see Table 3). Table 3
Apples
(fruit)
Bananas(banana)
Cherries(cherry)
Another example: numerical alphabetical order code arranged in the order of Chinese phonetic letters (see Table 4). Table 4
(date)
Advantages of numerical alphabetical order code: easy classification of coding objects (no phenomenon of multiple classifications), easy to maintain and can play the role of code index (written in alphabetical order), easy to search, 12
GB 7027 --
Disadvantages of numerical alphabetical order code: when compiling the standard, it is necessary to leave enough space for new classification coding objects at one time. Sometimes, in order to ensure the arrangement order of the newly added classification coding objects, the original space is not enough and needs to be re-encoded. Therefore, relatively speaking. This kind of code has a long service life, and the density of each type is uneven. Because this code is based on the principle of alphabetical order. It gathers together the classification coding objects with similar language and characters. Therefore, it will be more perfect if the classification coding objects are further subdivided according to other characteristics. This code structure is generally suitable for retrieving information based on names of people, institutions, enterprises, and public institutions. 2.3.2.3 Hierarchical code
Hierarchical code is often used in linear classification systems. It is a kind of code that arranges the order according to the subordinate and hierarchical relationship of the classification objects. For products, this arrangement order can be arranged according to properties such as process, material, and purpose. When encoding, the code is divided into several levels. And corresponding to the classification level of the classification object, the level represented by the code from left to right is from high to low. The left end of the code is the highest level code, and the right end is the lowest level code. The code of each level can use a sequential code or a serial sequential code. For example: GB4754-84 "National Economic Industry Classification and Code" uses a three-layer four-digit hierarchical code. The first, second, and third layers of codes represent major categories, medium categories, and minor categories respectively. The code structure is as follows: First layer code (major category)
Second layer code (medium category)
Third layer code (minor category)
Advantages of hierarchical code: can clearly indicate the category of the classified object; has a strict affiliation: simple code structure: large capacity, easy to machine summary,
Disadvantages of secondary code: poor code structure flexibility, when there are more levels, the code digits are longer. The decimal code used in book classification coding is basically the same as the coding principle of hierarchical code. The difference is that the decimal point symbol is used in the decimal coding structure. The digits can be arbitrarily expanded according to needs after the decimal point symbol. 2.3.2.4 Feature combination code
Feature combination code is commonly used in the "face classification system", which divides the classification object into several "faces" according to its attributes or characteristics, and the categories in each "face" are encoded according to their rules. Therefore, there is no hierarchical relationship or affiliation between the codes of "surfaces". When in use, the codes in each "surface" are selected as needed, and the codes are combined in the predetermined order of "surfaces" to indicate the category H
For example, for machine screws, there are four "surfaces": material, screw diameter, screw head shape and screw surface treatment condition. Each "surface" is divided into several categories and coded separately, as shown in Table 5: Table 5
Stainless steel
——Brass
2$ —
3-Hexagonal head
Square head
Fourth surface
1 -——Untreated
2··Chrome plated
3Galvanized
When in use, the codes of each "surface" are combined. For example, code 2342 means 1.5 square head chrome-plated screw in brass. Advantages of the characteristic combination code: The code structure has certain flexibility and is suitable for machine processing. Disadvantages of feature combination codes: low code capacity utilization, not convenient for summing and aggregation. 2.3.2.5 Composite code
GB7027—86
Composite code is a widely used meaningful code. It is often composed of two or more complete and independent codes. For example: the composite code composed of the classification part and the identification part is to divide the code of the classification coding object into two parts: the classification part and the identification part. The classification part represents the hierarchy and affiliation of the attributes or characteristics of the classification coding object. The identification part plays the role of the registration number (i.e. registration number) of the classification coding object, and often adopts a sequential code or a series sequential code. For example: the American material catalog applicable to the United States and "NATO" countries uses a 13-digit digital composite code, and its code structure is as follows:
Federal Material Sub-Coding Bureau Item Identification Number, Class Number, Code
Classification Part
Identification Part
Among them, the identification part is composed of a two-digit code representing the United States and "NATO" National Coding Bureau and a seven-digit item identification number. This is because the item identification number compiled by the "NATO" National Coding Bureau may be repeated with that of the United States Material Coding Bureau. The US material catalog identification code must be composed of the US or NATO National Code Bureau code and the item identification number. The nine-digit code is used together. Only in this way can its integrity be maintained, and one item is one code, which plays the role of unique identification. The classification part is composed of four digits, indicating the category of federal item classification. In order to facilitate management, a hierarchical code is adopted, which is divided into two levels: human and sub-category, and each is represented by two digits. Advantages of composite code: The code structure has great flexibility. It is easy to expand the code capacity and adjust the category of the coded object. At the same time, the identification part of the code can be used in different information systems, thus facilitating information exchange between several systems. Disadvantages of composite code: The total length of the code is relatively long. 2.4 Code verification
2.4.1 Purpose of code verification
In data processing, the code as the only representation of things or concepts is one of the important input contents of the computer. Therefore, the correctness of code input directly affects the quality of the entire computer data processing work. For longer codes and those critical codes, check codes should be added to check the errors caused by their input, transcription and other operations. 2.4.2 Calculation method of check code
In order to ensure the correct input of the code, a check code is added to the original code. The check code is obtained based on the original code through a predetermined mathematical algorithm. When the code with the check code is input, the computer will use the same mathematical algorithm to calculate the check code according to the input code number and compare it with the input check code. If they are consistent, it means that the code input is correct. If they are inconsistent, it means that the code input is incorrect, and it will automatically report the error to the input personnel. 2.4.2.1 Error types that can be detected by the check code The error types that can be detected by the check code are as follows: a. Single substitution error. That is, one character replaces another character. For example! 234 is mistakenly entered as 4234: b. Single substitution error. That is, any substitution of two adjacent characters, or two non-adjacent characters. For example, 12345 is mistakenly entered as 12354 or 12543;
c. Shift error. That is, the shift of the entire code, to the left or right. For example, it is mistakenly entered as red; d. Compound error. That is, the combination of the above errors. 2.4.2.2 Formation of check code
There are many algorithms for forming check code, which need to be unified and standardized. Now we introduce a commonly used check code algorithm by giving an example. Suppose: the code is 31504, and the weights of each digit of the code from left to right are 6-5-43-2. The steps of forming the "modulo 11" check code are as follows:
GB7027-86
a. Multiply each digit of the code by the corresponding weight, and get: (36)
b. Find the sum of each product and get:
(15)
(0×3)
186+20+0-851
Divide the sum by modulo 11 and get the remainder, that is: 51: 11 - 1 ... 7
(1×2)
d. Subtract the remainder from the "modulo 11" and use it as the check code. Add the code end to form a complete code with a better check code. 315014
When using, enter the code 315011 with the check code. When checking, repeat the check code to form step a-℃ (the weight of the check code is 1, and the other weights are the same as before). If the remainder is \, the code input is correct. If the remainder is not 1), the code input is wrong. Usually, the code is checked using a 1-digit check code. When the 1-digit check code meets the requirements, a 2-digit check code can be used.
When the code is composed of letters or letters and numbers, for the convenience of calculation, the corresponding values (~25 or 10~35) are assigned to A~Z (in alphabetical order).
2.4.2.3 Choice of modulus and weight
The modulus can be selected in various ways. The commonly used moduli are 9, 10, 11, 37, 97, etc. Among them, 11 is used most frequently. Generally speaking, the larger the value of the modulus, the higher the value of the modulus. The higher the error detection rate, the higher the value of the modulus should be. Usually, the value of the modulus should follow the following principles: The value of the modulus should be greater than or equal to the number of characters in the code character set (10 for numeric code, 26 for alphabetic code, and 36 for alphanumeric code): The modulus and the weights of each digit of the code are prime numbers: b. The modulus is preferably a prime number (10 is a commonly used non-prime modulus). The choice of weights is also varied. It can be obtained by a geometric series algorithm, an arithmetic series algorithm, or a fixed series. But the choice of weights should usually meet the following requirements: a. The code is composed of natural numbers:
b, a fixed order or a sequence obtained by a fixed algorithm. 2.5 Code types
Code types generally include the following: digital code, letter code, mixed code of numbers and letters. 2.5.1 Digital code
Digital code is a code that uses one or more Arabic numerals to represent the coded object. It can be referred to as digital code. The characteristics of digital code are simple structure, easy to use, easy to sort and easy to promote at home and abroad, but the description of the characteristics of the coded object is not intuitive. Digital code is a code form widely used in various countries. 2.5.2 Alphabet code
Letter code is a code that uses one or more letters to represent the coded object: it can be referred to as letter code. The characteristic of letter code is that letter code has much more capacity than digital code with the same number of digits, such as: 26 categories can be represented by 100 English letter codes. A one-digit code can only represent up to 10 categories (0-9), a two-digit code can represent up to 676 (262) categories, and a two-digit code can represent up to 100 (102) categories. 1. Similarly, a letter code can sometimes provide information that is easy for people to identify, such as the abbreviation code for railway station names formulated by the Ministry of Railways: HB represents Harbin, BJ represents Beijing.
GB7027---86
Letter codes are easy to remember and people have the habit of using them, but they are not easy for machines to process information, especially when there are many coded objects or when additions and changes are frequent and the names of coded objects are long, there will often be duplication and conflict. Therefore, this type of code is often used in situations where there are fewer coded objects.
2.5.3 Mixed digital and alphabetic codes
Mixed digital and alphabetic codes are codes composed of numbers and letters, or numbers, letters, and special characters. They can be referred to as letter-based digital codes or digital-alphanumeric codes. The characteristics of the mixed code of numbers and letters are that they basically have the advantages of both the digital code and the letter code, and have a strict structure. It has good intuitiveness and is easy to use. However, the complex code composition also brings certain disadvantages, such as inconvenient computer input, low input efficiency, increased error rate, and inconvenient machine processing: 1 For the above code types, sometimes in order to improve the intuitiveness of the code, when the code is long, separators such as "," and "" can be added in the middle of the code as needed, or in the form of "space". Digital codes, letter codes, and mathematical and letter mixed codes all have their own strengths. Usually, they are selected based on the user's needs, the amount of information, the frequency of information exchange, the capacity of the computer, the user's habits and other factors. However, considering the processing efficiency and information exchange, digital codes are better. Additional notes:
This standard is proposed by the Information Classification and Coding Research Institute of the National Bureau of Standards. This standard is drafted by the Information Classification and Coding Research Institute of the National Bureau of Standards. The drafter of this standard is Hu Jiazhang.3 Mixed digital and alphabetic codes
Mixed digital and alphabetic codes are codes composed of numbers and letters, or numbers, letters and special characters: they can be referred to as alphanumeric codes or alphanumeric codes. The characteristics of mixed digital and alphabetic codes are that they basically have the advantages of digital codes and alphabetic codes, and have a strict structure. They have good intuitiveness and are used in practice. However, the complex form of the code also brings certain disadvantages, such as inconvenient computer input, low input efficiency, increased error rate, and inconvenient machine processing: 1 For the above code types, sometimes in order to improve the intuitiveness of the code, when the code is long, separators such as "," and "" can be added to the code as needed, or in the form of "space". Digital codes, alphabetic codes, and mixed digital and alphabetic codes all have their own strengths. They are usually selected based on the user's requirements and the amount of information. The frequency of information exchange, the capacity of the computer, the user's habits and other factors are comprehensively considered, but from the perspective of interest processing efficiency and information exchange, digital codes are better. Additional Notes:
This standard was proposed by the Information Classification and Coding Research Institute of the National Bureau of Standards. This standard was drafted by the Information Classification and Coding Research Institute of the National Bureau of Standards. The drafter of this standard is Hu Jiazhang.
Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.