This standard specifies the general principles and methods for establishing a terminology corpus. This standard applies to the research, development, maintenance and related management of terminology corpora. Other work involving corpus construction can also refer to it. GB/T 19101-2003 General principles and methods for establishing a terminology corpus GB/T19101-2003 standard download decompression password: www.bzxz.net
This standard specifies the general principles and methods for establishing a terminology corpus. This standard applies to the research, development, maintenance and related management of terminology corpora. Other work involving corpus construction can also refer to it.
Some standard content:
ICS01.020 National Standard of the People's Republic of China GB/T19101—2003 General principles and methods of establishing terminology corpus2003-05-14Promulgated People's Republic of China General Administration of Quality Supervision, Inspection and Quarantine Implementation on 2003-12-01 Normative references Terms and definitions Basic requirements.. Corpus requirements Markup language requirements 4.3 Requirements for terminology corpus system 5 Processing and organization of terminology corpus 5.1 Levels of terminology corpus processing 5.2 Flow of terminology corpus processing 5.3 Organization of terminology corpus 6 Establishment and functional design of terminology corpus system 6.1 Establishment of terminology corpus system 6.2 Functional design of terminology corpus system 6.3 Service mode of terminology corpus system 7 Management and maintenance of terminology corpus system Appendix A (Informative Appendix) Related national standards for establishing terminology corpus GB/T19101—2003 This standard is one of the series of national standards for terminology databases. The series of standards that have been issued are: GB/T13726—1992| |tt||GB/T16785-1997 GB/T16786—1997 GB/T17532—1998 GB/T18155—2000 GB/T13725—2001 Magnetic tape format for recording and exchanging terminology and dictionary entriesTerminology workConcepts and terminology coordination Terminology work Computer application data categories Terminology workComputer application vocabulary GB/T19101—2003 Terminology workComputer application Machine Readable Terminology Interchange Format (MARTIF) Negotiation Exchange General Principles and Methods for Establishing Terminology DatabasesGB/T15387.1—2001 Guidelines for the Preparation of Terminology Database Development Documents GB/T15387.2—2001 Guidelines for Terminology Database Development GB/T15625—2001 : Technical Evaluation Guidelines for Terminology Databases GB/T19102—2003 Information Description Specification for Terminology Component Library Appendix A of this standard is an informative appendix. This standard is proposed by the National Technical Committee for Terminology Standardization. This standard is under the jurisdiction of the China Standards Research Center. This standard was drafted by the China Standards Research Center, Institute of Computational Linguistics of Peking University and other units. The main drafters of this standard are: Chen Yuzhong, Song Min, He Yan, Ye Sheng, Sui Zhifang, Cheng Yonghong, Xiao Yujing. 1 Scope General principles and methods for establishing terminology corpora This standard specifies the general principles and methods for establishing terminology corpora. GB/T19101—2003 This standard applies to the research, development, maintenance and related management of terminology corpora. Other work involving corpus construction can also be used as a reference. 2 Normative references The clauses in the following documents become clauses of this standard through reference in this standard. For any dated referenced document, all subsequent amendments (excluding errata) or revisions are not applicable to this standard. However, parties to an agreement based on this standard are encouraged to study whether the latest versions of these documents can be used. For any undated referenced document, the latest version shall apply to this standard. GB/T137252001 General principles and methods for establishing terminology database GB/T15237.1-2000 Vocabulary for terminology work Part 1: Theory and application (eqvISO1087-1:2000) 3 Terms and definitions The terms and definitions established in GB/T15237.1 apply to this standard. For ease of use, this standard repeats some of the terms and definitions. Term term The word designation of general concepts in a specific professional field. [GB/T15237.1--2000, 3.4.3] 3.2 Corpus A collection of language data collected for analysis. [GB/T15237.1-2000, 3.6.9] 3.3 Terminology corpus A corpus used for analyzing and studying terminology. Terminology corpus systemterminologycorpussystem contains the terminology corpus of the management framework. Note: modified from GB/T17532-1998,7.7.4 Basic requirements 4.1 Requirements for corpus 4.1.1—Consistency The corpus to be stored should be consistent in format and valid. 4.1.2 Applicability The corpus should be selected from formal publications or relevant professional literature published on authoritative websites. 4.1.3 Fidelity The original information and chapter structure of the corpus, such as the title, abstract, keywords and references, should be kept intact. GB/T19101-2003 4.1.4 Extensiveness When studying terminology in a specific field, it is advisable to collect corpus according to the principle of relatively balanced quantity in each sub-field. For the uneven distribution of new terms caused by the uneven development of various sub-fields, it is advisable to allow the number of corpora between some sub-fields to be appropriately allocated while keeping the total number of corpora collected in the field unchanged, so as to increase the coverage of new terms by the collected corpora. The source of corpora should consider the principle of subject matter diversity, and should comprehensively consider the application requirements of its professionalism, representativeness and objectivity. In the collection of translation and original works, a suitable ratio should be determined. When selecting corpora, the principle of regional distribution should also be appropriately considered, that is, academic articles from Hong Kong, Macao, Taiwan and overseas Chinese should be appropriately collected. 4.1.5 Timeliness The corpus should be supplemented and updated in a timely manner. 4.2 Requirements for markup language 4.2.1 Universality A widely used markup language with corresponding software toolkit should be adopted. 4.2.2 Simplicity It should be fully functional, simple to use, easy to expand and software development. 4.2.3 Interchangeability Should not be restricted by the specific platform used and should allow cross-platform exchange and sharing of corpora. 4.2.4 Value preservation It should be able to be used for a long time, and the annotated files should be easy to convert into other file formats, and can adapt to the requirements of various storage formats in practical applications of the corpus 4.3 Requirements for terminology corpus system 4.3.1 Design principles and quality requirements For the design principles and quality requirements of the terminology corpus system, please refer to 6.1 of GB/T13725-2001. 4.3.2 Requirements for computer system For the requirements of computer hardware and software of the terminology corpus system, please refer to 6.2 of GB/T13725--2001. 5 Processing and organization of terminology corpus 5.1 Processing level of terminology corpus The processing level of terminology corpus can be divided into three levels: a) Original terminology corpus, which is a terminology corpus without any annotation. b) Chapter-level annotated terminology corpus, which is a terminology corpus annotated with the first-level information of the text chapter. c) Terminology corpus at the terminology annotation level is a terminology corpus that annotates domain terminology information based on the text-level annotation. The scale of terminology corpus is generally large, and it is advisable to adopt a human-computer combination approach in the text annotation and terminology annotation of the corpus; in order to facilitate data exchange, the annotation tool should adopt a common markup language. 5.2 Processing flow of terminology corpus The general processing flow of terminology corpus is shown in Figure 1: 2 Other terms Corpus Corpus source: Literature, website, etc. 5.2.1 Corpus collection Corpus collection Standardization Information annotation Figure 1 General processing flow of terminology corpus GB/T19101—2003 Terminology corpus array Corpus can come from national standards, industry standards and other standard documents, or from officially published dictionaries, encyclopedias, journals, textbooks, newspapers and other reference books and relevant documents published by authoritative websites; it can also be obtained by networking with other terminology corpora, exchanging corpus data and recording media, etc. 5.2.2 Standardization Preliminary processing of corpus obtained from various channels according to the established standard format or rules. For example, checking for duplicate content in corpus, unified conversion of file formats, etc. 5.2.3 Information Annotation For the original corpus after normalization, markup language can be used to annotate information at the chapter level, term level, etc. in combination with the short-term and long-term goals of the project research. Generally, according to the different levels of terminology corpus processing, the optional annotation information includes the following three categories: 5.2.3.1 Chapter information mainly includes: file identification; chapter number; corpus source; discipline field (such as information science field): sub-field (such as computer science and technology field, electronic communication and automatic control technology field, information science and system science field, etc.); subject source (such as magazines, newspapers, books, etc.); work type (such as original work, translated work);-regional distribution (such as mainland corpus, Hong Kong and Taiwan corpus); time; title; author, author unit; -abstract; keywords; -text; paragraph; GB/T19101—2003 sentences, references, etc. 5.2.3.2Term information Mainly includes: Term; Term structure; Part of speech, etc. 5.2.3.3Others A multifunctional terminology corpus should be flexible and allow the addition of new descriptive information to meet the various types of information required by different user groups. 5.2.4Terminology corpus generation Generate a terminology corpus according to certain formats and requirements. 5.3Terminology corpus organization In order to facilitate terminology research, corpus exchange and terminology corpus system development, the storage and management of corpus in the terminology corpus should try to adopt a common classification method for classification and organization. Common classification methods include: China Standard Document Classification (CCS); International Standard Classification (ICS); c) GB/T13745 subject classification and code, etc. 6 Establishment and functional design of terminology corpus system 6.1 Establishment of terminology corpus system The basic process of establishing terminology corpus system should follow the principles and methods of general system establishment. 6.2 Functional design of terminology corpus system According to the needs of terminology research, terminology corpus system should generally provide functions such as terminology case query and field frequency information statistics. 6.3 The service mode of terminology corpus system It should be convenient for users to use and can be selected according to needs during system design. For example: query, online retrieval, access through the Internet, etc. Management and maintenance of terminology corpus system Should at least include the following contents: Corpus management and update; -Update of service mode or function; Maintenance and management of terminology corpus system;-Maintenance and management of input and output devices, etc. GB/T3860 GB/T10112 GB/T13190 GB/T13745 Appendix A (Informative Appendix) Rules for indexing thesaurus of relevant national standard documents for establishing terminology corpus Principles and methods of terminology work Rules for compiling Chinese thesaurus Subject classification and code G B/T15237.1 Vocabulary of terminology Part 1: Theoretical and applied terminology Vocabulary of computer applications GB/T17532 GB/T14814 Information processing text and office systems Standard Generalized Markup Language (SGML) GB/T19101—2003 GB/T19101-2003 People's Republic of China Country Standard General principles and methods for establishing terminology corpora GB/T19101—2003 Published by China Standards Press No. 16, Sanlihebei Street, Fuxingmenwai, Beijing Postal code: 100045 Tel: 6852394668517548 China Standards Press Printed by Huangdao Printing Factory Published by Xinhua Bookstore Beijing Distribution Office Sold by Xinhua Bookstores in all regions Format: 880×12301/16 Printing sheet: 3/4 Word count: 15,000 words First edition: October 2003 First printing: October 2003 Print run: 1-1000 Book number: 1550661-19896 Website: bzcbs.com Copyright reserved Infringements will be investigated Report hotline: (010) 685335333 The service mode of the terminology corpus system should be user-friendly and can be selected according to needs during system design. For example: query, online search, access through the Internet, etc. The management and maintenance of the terminology corpus system should at least include the following contents: corpus management and update; - update of service mode or function; - maintenance and management of the terminology corpus system; - maintenance and management of input and output devices, etc. GB/T3860 GB/T10112 GB/T13190 GB/T13745 Appendix A (Informative Appendix) National standard document thesaurus indexing rules for establishing terminology corpus Terminology work principles and methods Chinese thesaurus compilation rules Subject classification and code G B/T15237.1 Vocabulary of terminology Part 1: Theoretical and applied terminology Vocabulary of computer applications GB/T17532 GB/T14814 Information processing text and office systems Standard Generalized Markup Language (SGML) GB/T19101—2003 GB/T19101-2003 People's Republic of China Country Standard General principles and methods for establishing terminology corpora GB/T19101—2003 Published by China Standards Press No. 16, Sanlihebei Street, Fuxingmenwai, Beijing Postal code: 100045 Tel: 6852394668517548 China Standards Press Printed by Huangdao Printing Factory Published by Xinhua Bookstore Beijing Distribution OfficebZxz.net Sold by Xinhua Bookstores in all regions Format: 880×12301/16 Printing sheet: 3/4 Word count: 15,000 words First edition: October 2003 First printing: October 2003 Print run: 1-1000 Book number: 1550661-19896 Website: bzcbs.com Copyright reserved Infringements will be investigated Report hotline: (010) 685335333 The service mode of the terminology corpus system should be user-friendly and can be selected according to needs during system design. For example: query, online search, access through the Internet, etc. The management and maintenance of the terminology corpus system should at least include the following contents: corpus management and update; - update of service mode or function; - maintenance and management of the terminology corpus system; - maintenance and management of input and output devices, etc. GB/T3860 GB/T10112 GB/T13190 GB/T13745 Appendix A (Informative Appendix) National standard document thesaurus indexing rules for establishing terminology corpus Terminology work principles and methods Chinese thesaurus compilation rules Subject classification and code G B/T15237.1 Vocabulary of terminology Part 1: Theoretical and applied terminology Vocabulary of computer applications GB/T17532 GB/T14814 Information processing text and office systems Standard Generalized Markup Language (SGML) GB/T19101—2003 GB/T19101-2003 People's Republic of China Country Standard General principles and methods for establishing terminology corpora GB/T19101—2003 Published by China Standards Press No. 16, Sanlihebei Street, Fuxingmenwai, Beijing Postal code: 100045 Tel: 6852394668517548 China Standards Press Printed by Huangdao Printing Factory Published by Xinhua Bookstore Beijing Distribution Office Sold by Xinhua Bookstores in all regions Format: 880×12301/16 Printing sheet: 3/4 Word count: 15,000 words First edition: October 2003 First printing: October 2003 Print run: 1-1000 Book number: 1550661-19896 Website: bzcbs.com Copyright reserved Infringements will be investigated Report hotline: (010) 68533533 Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.