title>General principles and methods of establishing terminology corpus - GB/T 19101-2003 - Chinese standardNet - bzxz.net
Home > GB > General principles and methods of establishing terminology corpus
General principles and methods of establishing terminology corpus

Basic Information

Standard ID: GB/T 19101-2003

Standard Name:General principles and methods of establishing terminology corpus

Chinese Name: 建立术语语料库的一般原则与方法

Standard category:National Standard (GB)

state:in force

Date of Release2003-05-14

Date of Implementation:2003-12-01

standard classification number

Standard ICS number:General, Terminology, Standardization, Documentation >> 01.020 Terminology (Principles and Coordination)

Standard Classification Number:General>>Basic Standards>>A22 Terms and Symbols

associated standards

Publication information

publishing house:China Standards Press

ISBN:155066.1-19896

Publication date:2003-12-01

other information

Release date:2003-05-14

Review date:2004-10-14

drafter:Chen Yuzhong, Song Min, He Yan, Ye Sheng, Sui Zhifang, Cheng Yonghong, Xiao Yujing

Drafting unit:China Standards Research Center

Focal point unit:National Technical Committee on Terminology Standardization

Proposing unit:National Technical Committee on Terminology Standardization

Publishing department:General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China

competent authority:National Standardization Administration

Introduction to standards:

This standard specifies the general principles and methods for establishing a terminology corpus. This standard applies to the research, development, maintenance and related management of terminology corpora. Other work involving corpus construction can also refer to it. GB/T 19101-2003 General principles and methods for establishing a terminology corpus GB/T19101-2003 standard download decompression password: www.bzxz.net
This standard specifies the general principles and methods for establishing a terminology corpus. This standard applies to the research, development, maintenance and related management of terminology corpora. Other work involving corpus construction can also refer to it.


Some standard content:

ICS01.020
National Standard of the People's Republic of China
GB/T19101—2003
General principles and methods of establishing terminology corpus2003-05-14Promulgated
People's Republic of China
General Administration of Quality Supervision, Inspection and Quarantine
Implementation on 2003-12-01
Normative references
Terms and definitions
Basic requirements..
Corpus requirements
Markup language requirements
4.3 Requirements for terminology corpus system
5 Processing and organization of terminology corpus
5.1 Levels of terminology corpus processing
5.2 Flow of terminology corpus processing
5.3 Organization of terminology corpus
6 Establishment and functional design of terminology corpus system 6.1 Establishment of terminology corpus system
6.2 Functional design of terminology corpus system
6.3 Service mode of terminology corpus system
7 Management and maintenance of terminology corpus system
Appendix A (Informative Appendix)
Related national standards for establishing terminology corpus GB/T19101—2003
This standard is one of the series of national standards for terminology databases. The series of standards that have been issued are:
GB/T13726—1992| |tt||GB/T16785-1997
GB/T16786—1997
GB/T17532—1998
GB/T18155—2000
GB/T13725—2001
Magnetic tape format for recording and exchanging terminology and dictionary entriesTerminology workConcepts and terminology coordination
Terminology work
Computer application data categories
Terminology workComputer application vocabulary
GB/T19101—2003
Terminology workComputer application Machine Readable Terminology Interchange Format (MARTIF) Negotiation Exchange
General Principles and Methods for Establishing Terminology DatabasesGB/T15387.1—2001
Guidelines for the Preparation of Terminology Database Development Documents
GB/T15387.2—2001
Guidelines for Terminology Database Development
GB/T15625—2001
: Technical Evaluation Guidelines for Terminology Databases
GB/T19102—2003
Information Description Specification for Terminology Component Library
Appendix A of this standard is an informative appendix.
This standard is proposed by the National Technical Committee for Terminology Standardization. This standard is under the jurisdiction of the China Standards Research Center. This standard was drafted by the China Standards Research Center, Institute of Computational Linguistics of Peking University and other units. The main drafters of this standard are: Chen Yuzhong, Song Min, He Yan, Ye Sheng, Sui Zhifang, Cheng Yonghong, Xiao Yujing. 1 Scope
General principles and methods for establishing terminology corpora This standard specifies the general principles and methods for establishing terminology corpora. GB/T19101—2003
This standard applies to the research, development, maintenance and related management of terminology corpora. Other work involving corpus construction can also be used as a reference.
2 Normative references
The clauses in the following documents become clauses of this standard through reference in this standard. For any dated referenced document, all subsequent amendments (excluding errata) or revisions are not applicable to this standard. However, parties to an agreement based on this standard are encouraged to study whether the latest versions of these documents can be used. For any undated referenced document, the latest version shall apply to this standard. GB/T137252001 General principles and methods for establishing terminology database GB/T15237.1-2000 Vocabulary for terminology work Part 1: Theory and application (eqvISO1087-1:2000) 3 Terms and definitions
The terms and definitions established in GB/T15237.1 apply to this standard. For ease of use, this standard repeats some of the terms and definitions.
Term term
The word designation of general concepts in a specific professional field. [GB/T15237.1--2000, 3.4.3] 3.2
Corpus
A collection of language data collected for analysis. [GB/T15237.1-2000, 3.6.9] 3.3
Terminology corpus
A corpus used for analyzing and studying terminology.
Terminology corpus systemterminologycorpussystem contains the terminology corpus of the management framework.
Note: modified from GB/T17532-1998,7.7.4 Basic requirements
4.1 Requirements for corpus
4.1.1—Consistency
The corpus to be stored should be consistent in format and valid. 4.1.2 Applicability
The corpus should be selected from formal publications or relevant professional literature published on authoritative websites. 4.1.3 Fidelity
The original information and chapter structure of the corpus, such as the title, abstract, keywords and references, should be kept intact. GB/T19101-2003
4.1.4 Extensiveness
When studying terminology in a specific field, it is advisable to collect corpus according to the principle of relatively balanced quantity in each sub-field. For the uneven distribution of new terms caused by the uneven development of various sub-fields, it is advisable to allow the number of corpora between some sub-fields to be appropriately allocated while keeping the total number of corpora collected in the field unchanged, so as to increase the coverage of new terms by the collected corpora. The source of corpora should consider the principle of subject matter diversity, and should comprehensively consider the application requirements of its professionalism, representativeness and objectivity. In the collection of translation and original works, a suitable ratio should be determined. When selecting corpora, the principle of regional distribution should also be appropriately considered, that is, academic articles from Hong Kong, Macao, Taiwan and overseas Chinese should be appropriately collected. 4.1.5 Timeliness
The corpus should be supplemented and updated in a timely manner.
4.2 Requirements for markup language
4.2.1 Universality
A widely used markup language with corresponding software toolkit should be adopted. 4.2.2 Simplicity
It should be fully functional, simple to use, easy to expand and software development. 4.2.3 Interchangeability
Should not be restricted by the specific platform used and should allow cross-platform exchange and sharing of corpora. 4.2.4 Value preservation
It should be able to be used for a long time, and the annotated files should be easy to convert into other file formats, and can adapt to the requirements of various storage formats in practical applications of the corpus
4.3 Requirements for terminology corpus system
4.3.1 Design principles and quality requirements
For the design principles and quality requirements of the terminology corpus system, please refer to 6.1 of GB/T13725-2001. 4.3.2 Requirements for computer system
For the requirements of computer hardware and software of the terminology corpus system, please refer to 6.2 of GB/T13725--2001. 5 Processing and organization of terminology corpus
5.1 Processing level of terminology corpus
The processing level of terminology corpus can be divided into three levels: a) Original terminology corpus, which is a terminology corpus without any annotation. b) Chapter-level annotated terminology corpus, which is a terminology corpus annotated with the first-level information of the text chapter. c) Terminology corpus at the terminology annotation level is a terminology corpus that annotates domain terminology information based on the text-level annotation. The scale of terminology corpus is generally large, and it is advisable to adopt a human-computer combination approach in the text annotation and terminology annotation of the corpus; in order to facilitate data exchange, the annotation tool should adopt a common markup language. 5.2 Processing flow of terminology corpus
The general processing flow of terminology corpus is shown in Figure 1: 2
Other terms
Corpus
Corpus source:
Literature, website, etc.
5.2.1 Corpus collection
Corpus collection
Standardization
Information annotation
Figure 1 General processing flow of terminology corpus
GB/T19101—2003
Terminology corpus array
Corpus can come from national standards, industry standards and other standard documents, or from officially published dictionaries, encyclopedias, journals, textbooks, newspapers and other reference books and relevant documents published by authoritative websites; it can also be obtained by networking with other terminology corpora, exchanging corpus data and recording media, etc. 5.2.2 Standardization
Preliminary processing of corpus obtained from various channels according to the established standard format or rules. For example, checking for duplicate content in corpus, unified conversion of file formats, etc.
5.2.3 Information Annotation
For the original corpus after normalization, markup language can be used to annotate information at the chapter level, term level, etc. in combination with the short-term and long-term goals of the project research. Generally, according to the different levels of terminology corpus processing, the optional annotation information includes the following three categories: 5.2.3.1 Chapter information
mainly includes:
file identification;
chapter number;
corpus source;
discipline field (such as information science field): sub-field (such as computer science and technology field, electronic communication and automatic control technology field, information science and system science field, etc.);
subject source (such as magazines, newspapers, books, etc.); work type (such as original work, translated work);-regional distribution (such as mainland corpus, Hong Kong and Taiwan corpus); time;
title;
author,
author unit;
-abstract;
keywords;
-text;
paragraph;
GB/T19101—2003
sentences,
references, etc.
5.2.3.2Term information
Mainly includes:
Term;
Term structure;
Part of speech, etc.
5.2.3.3Others
A multifunctional terminology corpus should be flexible and allow the addition of new descriptive information to meet the various types of information required by different user groups.
5.2.4Terminology corpus generation
Generate a terminology corpus according to certain formats and requirements. 5.3Terminology corpus organization
In order to facilitate terminology research, corpus exchange and terminology corpus system development, the storage and management of corpus in the terminology corpus should try to adopt a common classification method for classification and organization. Common classification methods include: China Standard Document Classification (CCS);
International Standard Classification (ICS);
c) GB/T13745 subject classification and code, etc. 6 Establishment and functional design of terminology corpus system 6.1 Establishment of terminology corpus system
The basic process of establishing terminology corpus system should follow the principles and methods of general system establishment. 6.2 Functional design of terminology corpus system
According to the needs of terminology research, terminology corpus system should generally provide functions such as terminology case query and field frequency information statistics. 6.3 The service mode of terminology corpus system
It should be convenient for users to use and can be selected according to needs during system design. For example: query, online retrieval, access through the Internet, etc. Management and maintenance of terminology corpus system
Should at least include the following contents:
Corpus management and update;
-Update of service mode or function;
Maintenance and management of terminology corpus system;-Maintenance and management of input and output devices, etc. GB/T3860
GB/T10112
GB/T13190
GB/T13745
Appendix A
(Informative Appendix)
Rules for indexing thesaurus of relevant national standard documents for establishing terminology corpus
Principles and methods of terminology work
Rules for compiling Chinese thesaurus
Subject classification and code
G B/T15237.1 Vocabulary of terminology Part 1: Theoretical and applied terminology Vocabulary of computer applications
GB/T17532
GB/T14814
Information processing text and office systems Standard Generalized Markup Language (SGML) GB/T19101—2003
GB/T19101-2003
People's Republic of China
Country Standard
General principles and methods for establishing terminology corpora GB/T19101—2003
Published by China Standards Press
No. 16, Sanlihebei Street, Fuxingmenwai, Beijing
Postal code: 100045
Tel: 6852394668517548
China Standards Press Printed by Huangdao Printing Factory Published by Xinhua Bookstore Beijing Distribution Office
Sold by Xinhua Bookstores in all regions
Format: 880×12301/16 Printing sheet: 3/4 Word count: 15,000 words First edition: October 2003 First printing: October 2003 Print run: 1-1000
Book number: 1550661-19896
Website: bzcbs.com
Copyright reserved
Infringements will be investigated
Report hotline: (010) 685335333 The service mode of the terminology corpus system should be user-friendly and can be selected according to needs during system design. For example: query, online search, access through the Internet, etc. The management and maintenance of the terminology corpus system should at least include the following contents: corpus management and update; - update of service mode or function; - maintenance and management of the terminology corpus system; - maintenance and management of input and output devices, etc. GB/T3860
GB/T10112
GB/T13190
GB/T13745
Appendix A
(Informative Appendix)
National standard document thesaurus indexing rules for establishing terminology corpuswwW.bzxz.Net
Terminology work principles and methods
Chinese thesaurus compilation rules
Subject classification and code
G B/T15237.1 Vocabulary of terminology Part 1: Theoretical and applied terminology Vocabulary of computer applications
GB/T17532
GB/T14814
Information processing text and office systems Standard Generalized Markup Language (SGML) GB/T19101—2003
GB/T19101-2003
People's Republic of China
Country Standard
General principles and methods for establishing terminology corpora GB/T19101—2003
Published by China Standards Press
No. 16, Sanlihebei Street, Fuxingmenwai, Beijing
Postal code: 100045
Tel: 6852394668517548
China Standards Press Printed by Huangdao Printing Factory Published by Xinhua Bookstore Beijing Distribution Office
Sold by Xinhua Bookstores in all regions
Format: 880×12301/16 Printing sheet: 3/4 Word count: 15,000 words First edition: October 2003 First printing: October 2003 Print run: 1-1000
Book number: 1550661-19896
Website: bzcbs.com
Copyright reserved
Infringements will be investigated
Report hotline: (010) 685335333 The service mode of the terminology corpus system should be user-friendly and can be selected according to needs during system design. For example: query, online search, access through the Internet, etc. The management and maintenance of the terminology corpus system should at least include the following contents: corpus management and update; - update of service mode or function; - maintenance and management of the terminology corpus system; - maintenance and management of input and output devices, etc. GB/T3860
GB/T10112
GB/T13190
GB/T13745
Appendix A
(Informative Appendix)
National standard document thesaurus indexing rules for establishing terminology corpus
Terminology work principles and methods
Chinese thesaurus compilation rules
Subject classification and code
G B/T15237.1 Vocabulary of terminology Part 1: Theoretical and applied terminology Vocabulary of computer applications
GB/T17532
GB/T14814
Information processing text and office systems Standard Generalized Markup Language (SGML) GB/T19101—2003
GB/T19101-2003
People's Republic of China
Country Standard
General principles and methods for establishing terminology corpora GB/T19101—2003
Published by China Standards Press
No. 16, Sanlihebei Street, Fuxingmenwai, Beijing
Postal code: 100045
Tel: 6852394668517548
China Standards Press Printed by Huangdao Printing Factory Published by Xinhua Bookstore Beijing Distribution Office
Sold by Xinhua Bookstores in all regions
Format: 880×12301/16 Printing sheet: 3/4 Word count: 15,000 words First edition: October 2003 First printing: October 2003 Print run: 1-1000
Book number: 1550661-19896
Website: bzcbs.com
Copyright reserved
Infringements will be investigated
Report hotline: (010) 68533533
Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.