Basic requirements for bilingual parallel corpus processing service
other information
drafter:Liu Zhiyang, Zhang Jing, Ye Jian, Chai Ying, Huang Baorong, Luo Huifang, Meng Yongye, Zhu Li, Zhang Xuetao, Wang Haitao, Zhu Xianchao, Han Lintao, Zheng Chunping, He Zhongjun, Yu Limei, Zhang Chunliang, Gan Keqin, Zhang Baolin
Drafting unit:China National Institute of Standardization, Translators Association of China, Shanghai Yizhe Information Technology Co., Ltd., Shanghai Youyi Information Technology Co., Ltd., Global Tone Communication Technology Co., Ltd., Beijing Yueer Information
Focal point unit:National Technical Committee on Language and Terminology Standardization (SAC/TC 62)
Proposing unit:National Technical Committee on Language and Terminology Standardization (SAC/TC 62)
Publishing department:State Administration for Market Regulation National Standardization Administration
Some standard content:
ICS03.080.99:35.240.30
CCSA10
National Standard of the People's Republic of China
GB/T40035—2021
Basic requirements for bilingual parallel corpus processing service2021-04-30 Release
State Administration for Market Regulation
National Standardization Administration
2021-11-01 Implementation
Normative Reference Documents
Terms and Definitions
Basic Requirements
Service Provider
Corpus Workers
Service Environment
Processing Content
Working Results
Completeness
Accuracy
Availability
Normativity
5.6 Corpus Plus T Tool| |tt||Reliability
Ease of use
Localized interface
Operation function
Help system·
5.6.3 Compatibility
6 Processing flow
Preprocessing
Corpus preparation
6.1.4 Desensitization
6.2 Corpus alignment
Corpus review
Service content
Requirement communication
Customer agreement
Project management
Processing link
-rrKaeerkAca-
GB/T 40035—2021
GB/T40035—2021
Delivery Content
Quality Assurance Period
Service Evaluation and Improvement
8 Data Security
Data Backup
Document Management and Log
8.3 Data Storage
Appendix A (Informative)
Appendix 1B3 (Informative)
Appendix (Informative)
Appendix D (Informative)
Appendix E (Informative)
References
Training of Bilingual Parallel Corpus Processing Personnel
Bilingual Corpus Processed metadata
Common encoding format of TXT files
TMX format specification
File naming rules, encoding format and file format-rrKaeerKa-
GB/T40035—2021
This document is drafted in accordance with the provisions of GB/T1.1-2020 "Guidelines for Standardization Work Part 1: Structure and Rules for Standardization Documents".
Please note that some contents of this document may involve patents. The issuing agency of this document does not assume the responsibility for identifying patents. This document is proposed and managed by the National Technical Committee for Standardization of Language and Technical Spectrum (SAC/TC62). Drafting units of this document: China National Institute of Standardization, China Translators Association, Shanghai Yizhe Information Technology Co., Ltd., Shanghai Youyi Information Technology Co., Ltd., China Translation and Interpretation Technology Co., Ltd., Beijing Yueer Information Technology Co., Ltd., Suzhou Lianyue Technology Co., Ltd., Sichuan Language Bridge Information Technology Co., Ltd., Beijing Baidu Netcom Technology Co., Ltd., Shenyang Yayi Network Technology Co., Ltd., Shanghai Zhishanhe Network Technology Co., Ltd. Beijing Language and Culture University Beijing University of Posts and Telecommunications. The main drafters of this document are: Liu Zhiyang, Zhang Jing, Ye Jian, Chai Ying, Huang Baorong, Luo Huifang, Meng Yongye, Zhu Li, Zhang Xuetao, Gan Haitao, Wei Xianchao, Han Lintao, Zheng Chunping, He Zhongjun, Yu Limei, Zhang Chunliang, Gan Keqin, Zhang Baolin. C
-riKacerKAca-
1 Scope
Basic requirements for bilingual parallel corpus processing services GB/T40035—2021
This document specifies the basic requirements, processing procedures, service content and data security of bilingual parallel corpus processing services. This document is applicable to digital bilingual corpus processing services with original texts and translations as objects and texts as expression forms. The corpus processing of other digital texts can also be used for reference, and it is also applicable to the evaluation of corpus alignment tools. Normative references
This document has no normative references.
Terms and definitions
The following terms and definitions apply to this document. 3.1
Data in the form of characters, symbols, words, phrases, paragraphs, sentences, tables or other arrangements of characters used to express meaning, whose interpretation depends essentially on the reader's knowledge of a natural or artificial language. [Source: GB/T48942009, 4.1.1.2.1] 3.2
corpus
linguistic material or data
bilingual parallel corpus bilingual parallel corpus consisting of two languages and aligned in parallel at the level of chapter, paragraph, sentence or other level (3.2). 3.4
source language text
source language text (3.1).
[Source GB/T19363.1-2008, 3.4. modified] 3.5
larget language lexicon
large language text (3.1).
[Source: GB/T19363.12008,3.5, modified3.6
clientwww.bzxz.net
Individual or organization that accepts products or services provided according to its requirements. [Source: GB/T19000—2016,3.2.4, modified3.7
Metadata
metadata
Descriptive data about the content, quality, status and other characteristics of data. 1
-rrKaeerkAca-
GB/T 40035—2021
service provider
Service provider
Individual or organization providing services,
Optical character recognitionoptical character recognition; OCR automatically recognizes characters in images obtained by scanners, digital cameras, cameras, etc., for storage, editing and retrieval, L Source: GB/T31219.2—2014,3.4]3.10
TMXTranslationMemoryeXchangeStandard format for translation memory exchange.
corpus alignment
Corpus alignment
Align bilingual corpora (3.2) at the chapter, paragraph, sentence or other level to form a parallel comparison: 3.12
Corpus alignment tool
Corpus alignment tool
A tool used to align bilingual corpora and produce bilingual parallel corpora (3.3). 3.13
Correction
Measures taken to eliminate the unqualified content that has been found. Source: GB/T19000—2016, 3.12.33.14
De-identification
The process of removing the connection between data that can identify an individual or organization and the data subject. [Source. ISO/TS25237:2008, 3.183.15
Sensitive information
scnsitiveinformation
Information that may cause potential harm if disclosed or misused [Source: GB/T48942009, 4.7.3.2.4, with modifications3.16
anonymized data
Anonymized data
Remove personal or organizational data directly related to the data subject. Source: GB/T48942009, 4.7.3.2.3, with modifications4General Principles
4.1 Bilingual parallel corpus processing service is a service that establishes a correspondence between the original and translated text content provided by the customer at the paragraph, sentence or other level.
4.2 The purpose of the bilingual parallel corpus processing service is to obtain bi-aligned text data to provide basic data for computer-assisted translation, machine translation and linguistic research.
4.3 The objects of bilingual parallel corpus processing include the original text, the translation and the metadata of the processed text. 2
-rrKaeerkAca-
GB/T40035-2021
4.4 The bilingual parallel corpus processing service provider (hereinafter referred to as the "service provider") does not conduct a review of the translation, and the quality of the translation is guaranteed by the customer.
4.5: Bilingual parallel corpus processing services can be completed using multiple tools or in an integrated environment. The environment should integrate alignment, metadata collection and other functions to meet the needs of bilingual parallel corpus processing services. 5 Basic requirements
5.1 Service provider
The service provider must meet the following conditions:
l) Establish a complete corpus processing process system, including but not limited to data preprocessing, corpus alignment, project management, quality control, etc.; be equipped with qualified corpus processing personnel;
Equipped with stable and available corpus alignment tools and related text processing tools; d) Equipped with a place where corpus processing services can be completed. 2 Corpus Processors
The service provider shall ensure that bilingual parallel corpus processors have the following capabilities:a) Ability to read source and target languages: Ability to understand source and target languages, and to quickly read the original text and translationb) Ability to research and process texts: Ability to develop necessary text processing and professional knowledge, and to develop strategies to effectively utilize existing resources:
Technical capabilities: Utilize technical resources, including the use of tools and information systems to support the entire corpus processing process and complete various c)
technical tasks.
Note: See Appendix A for the training of bilingual parallel corpus processors. 5.3
Service environment
The service provider's service environment should have the technical equipment and office equipment required to complete bilingual corpus processing, such as optical recognition tools, alignment tools, etc. The client can agree with the service provider on the tool name and version to be used during processing. The confidentiality environment and level of the service provider shall comply with the client's requirements for corpus confidentiality. The service provider shall be equipped with confidentiality equipment, perform security reinforcement, and provide confidentiality training for corpus processing personnel according to the client's requirements. 5.4 Processing content
The bilingual corpus shall be provided by the client. The corpus can come from formal publications, internal company materials, websites, etc. Bilingual corpus processing should give priority to digitized bilingual corpus. Bilingual corpus that has not yet been digitized can be converted into digital form by scanning or taking photos, or directly input by keyboard: Bilingual corpus input by optical character recognition or keyboard should add proofreading to ensure the quality of content. 5.5 Processing Results
5.5.1 Integrity
On the premise of meeting the customer's data processing requirements, the processing results of the service provider should ensure the integrity of the original text, translation and metadata, and ensure that there is no information loss in the processing results
Note: The metadata of bilingual corpus processing is shown in Appendix B5.5.2 Accuracy
On the premise of meeting the customer's data processing requirements, the processing results of the service provider should ensure the accuracy of the correspondence between the original text and the translation-riKacerKAca-
GB/T40035—2021
and the accuracy of metadata, and ensure that the processing results are accurate. Note: The metadata of bilingual corpus processing is shown in Appendix B. 5.5.3 Availability
The service provider shall ensure that the processing results meet the following requirements: can be parsed by corpus retrieval, management and production tools: a)
b) no garbled characters, redundant tags and other unusable information: c) no format confusion or mismatch between the original text and the translated text; d) no redundant information not required by the user
5.5.4 Standardization
The processing results of the service provider shall meet the customer's specifications. The data format of the processing results shall include TMX, TXT, etc., and meet the following requirements:
a) TMX files shall comply with the translation memory exchange specifications, including metadata information such as the retained version number, encoding format, name of the tool that produced the corpus, production time, bilingual language encoding, etc.: b) TXT files shall use a common encoding format of the Tianxue Fu set, such as UTF-8. Note: The common encoding format of TXT files can be found in Appendix C, and the TMX format specification can be found in Appendix D. 5.6 Corpus Processing Tools
Corpus alignment is a key step in bilingual parallel corpus processing. Therefore, as an important part of corpus processing tools, corpus alignment tools should meet the following three requirements: reliability, usability and compatibility. 5.6.1 Reliability
Corpus alignment tools should be able to provide alignment functions without affecting the operation of other functions when local functional failures occur. Corpus alignment tools should provide automatic saving and recovery functions for alignment process data. 2 Usability
5.6.2.1 Localized interface
Corpus alignment tools should support Chinese interfaces. 5.6.2.2 Operational functions
The corpus alignment tool should support the operational functions required for aligning bilingual texts a) Text editing: In the content identification area where text input is allowed, support text modification, deletion and addition, etc.; Merge: support merging text distributed in two lines into one line; b)
Split: support splitting a line of text into two lines; Move up: support moving the text position up; Move down: support moving the text position down; Insert: support inserting a line at or below a line of text; Delete: support deleting a line or multiple lines of text; Back: support backing to the previous operation, and stay at the current operation when there is no previous step; Alignment: support performing paragraph or sentence level alignment after text adjustment is completed; Export: support exporting the aligned bilingual text after alignment is completed; k)
Save: support saving the text in the alignment process. 4
rKaeerkAca-
5.6.2.3 Help system
The corpus alignment tool should provide:
GB/T40035-2021
a) Offline help documents or online help support for system functions, and keep consistent with the functions of the tool, so that users can quickly get corresponding help when they encounter problems in the process of using the system: Basic operation guidance, so that users can quickly understand the operation skills in the process of using the system; b)
) Friendly interactive prompts, which can help users find the error location and prompt the cause of the error. 5.6.2.4 Efficiency
The efficiency of the corpus alignment tool should be evaluated from the following aspects: Response time:
1) Tool startup time;
2) Response time for basic operations such as automatic alignment, splitting, merging, and saving; Recovery time: When the tool is closed and opened again, the time to quickly locate the last operation position. 3)
Convenience:
1) Support shortcut key operation;
2) Support right-click menu.
3 Compatibility
The compatibility requirements of corpus alignment tools are as follows: The server-side corpus alignment tool should indicate the supported browsers and avoid using scripts and plug-ins based on specific browsers and specific operating systems;
The server-side corpus alignment tool should adapt to the display of different browsers and resolutions, and should provide at least one recommended browser b)
and resolution to ensure that the layout and elements of the web page displayed under this browser and resolution are complete and correct; c)
The local corpus alignment tool should provide a complete installation document, which should explain the supported operating systems, application configuration information, and common problem prompts.
6 Processing Flow
6.1 Preprocessing
Corpus Preparation
For digitized corpora in image format or scanned version, they must first be converted into editable electronic text corpora through optical character recognition or direct keyboard recording:
6.1.2 Cleaning
Check and correct garbled characters and special characters in the corpus. 6.1.3 Deduplication
Check the data for duplicates, check the existing bilingual corpus data and metadata, and try to use the customer's existing data to avoid repeated processing.
6.1.4 Desensitization
Desensitize the data according to the customer's desensitization requirements, remove the identity information and other sensitive information in the corpus, and convert the corpus into anonymous data.
6.2 Corpus Alignment
After the corpus processor uses the corpus alignment tool to import the bilingual file, the tool performs automatic segmentation combined with automatic alignment and manual alignment, and then exports the final bilingual parallel corpus. When exporting, the source language and target language, corpus name, corpus format and other information should be confirmed:
Note: File naming rules, encoding format and file format are shown in Appendix E6.3 Corpus Review
The service provider shall conduct a sampling inspection of the processing results, and the sampling number shall not be less than 10% of the total number of results, and the accuracy of the sampled data shall not be less than 99%
The service provider shall check the processing results in accordance with the specifications provided by the customer and refer to the examples provided by the customer to ensure that the processing results meet the customer's requirements. The inspection results shall be recorded and archived. 7 Service Content
7.1. Demand Communication
The service provider shall establish a perfect demand communication mechanism with the client. Before accepting the client's bilingual parallel corpus processing task, the service provider shall communicate with the client to clarify the processing level of the original text and the translated text, the scope of metadata collection, desensitization requirements, and their feasibility. Because the efficiency of the bilingual parallel corpus processing service is greatly affected by factors such as the processing level, whether the original text and the translated text have been digitized, whether the metadata is easy to collect, and the degree of desensitization. For corpus that has not yet been digitized, the service provider shall reach an agreement with the client on the processing method of digitizing the corpus (optical character recognition or direct keyboard recording). According to the client's use of the corpus, bilingual parallel corpus processing can be divided into the following two levels. a) Standard level. Perform paragraph or sentence alignment on the original text and the translated text, and collect basic metadata. b) Fine annotation level. According to the client's requirements, in addition to corpus alignment and metadata collection, the corpus is subjected to line segmentation, part-of-speech tagging, syntactic tagging, semantic tagging, etc.
7.2 Client Agreement
The service provider shall reach an agreement with the client and keep a record. If the agreement is reached verbally or by telephone, the service provider shall confirm the agreement and its terms in writing (such as letter, fax or email, etc.). The client and the service provider shall agree on the level of corpus processing (paragraph level, sentence level, etc.). If the sentence is the basic unit, the two parties shall agree on the sentence segmentation rules, the processing rules for the original meaning and the translated meaning that do not correspond, etc. The client shall send the relevant specifications (such as segmentation rules, usage, etc.) together with examples to the service provider, and the service provider shall follow them. The client and the service provider may negotiate and agree on the ownership of the intellectual property rights of the corpus and the requirements for data confidentiality. During the implementation of the agreement, if there is any discrepancy with the agreement, the parties shall reach an agreement, revise the agreement and record and archive it.
7.3 Project Management
The service provider shall arrange a project manager to perform tasks such as task allocation, progress management, and quality inspection for the corpus processing project. 7.4 Processing Phases
The bilingual parallel corpus processing phase includes bilingual corpus preprocessing, bilingual corpus alignment, and bilingual parallel corpus review. 6
-riKacerKAca-
,Deliverable Content
GB/T 40035—2021
The delivery content of bilingual parallel corpus processing shall include bilingual parallel corpus and processing report. The delivery requirements are as follows: a) Bilingual parallel corpus shall be delivered via mobile storage media or cloud storage, and shall include metadata information such as the name of the processing service provider, delivery period, and total number of corpus items: The processing report shall be delivered via mobile storage media or cloud storage, and shall include information such as the overview of the corpus provided by the customer, description of the processing process, number of days for actual processing and delivery of corpus items, description of failed corpus, description of the accuracy of delivered corpus, tools used for processing, and time taken to complete the service.
7.6 Quality Assurance Period
The service provider shall agree with the customer on the quality assurance period. If not agreed, the minimum quality assurance period shall be one year. During the quality assurance period, the service provider shall fix the corpus processing problems raised by the customer. 7.7 Service Evaluation and Improvement
The service provider shall designate a dedicated person to track customer feedback, record and organize it, take corresponding improvement measures, and optimize the corpus processing process.
For bilingual corpora delivered in batches, the service provider shall arrange dedicated customer service personnel to track the quality after each batch of data processing results is delivered, ask for customer feedback, and take complementary improvement measures. 8 Data security
Data backup
In each link of bilingual parallel corpus processing, it is necessary to ensure the safety and order of bilingual corpus, and timely make multiple data backups. 8.2 Document management and log
The entire process of bilingual corpus processing should record the operation log, and timely write and summarize the technical and management documents of the processing process. 8.3
Data storage
The service provider shall store the customer's demand documents, processing documents and final delivery documents according to the customer and period, so as to facilitate corpus information query and customer tracking:
-rrKaeerKAca-
GB/T40035—2021
Appendix A
(Informative)
Training of bilingual parallel corpus processing personnel|| tt||A.1 Training of bilingual parallel corpus processors on the knowledge and skills required for corpus processing can: a) provide bilingual parallel corpus processors with the skills required for corpus processing; b) help them meet the growing demand for corpus processing and improve efficiency; e
promote the development and innovation of bilingual parallel corpus processing technology A,2 Training of bilingual parallel corpus processors can include: advanced semantic processing techniques, using scripts to process bilingual texts; a)
b) desensitization and corpus cleaning (including removing garbled characters, formatting tags, etc. in the corpus) technology, so as to better handle the bilingual corpus alignment scenario;
use quality tools to perform quality checks at the end of the project, such as checking the legality of the format, etc.: 8
rrKaeerkAca-3
Data Storage
The service provider shall store the customer's demand documents, processing documents and final delivery documents by customer and period for corpus information query and customer tracking:
-rrKaeerKAca-
GB/T40035—2021
Appendix A
(Informative)
Training of Bilingual Parallel Corpus Processors
A.1 The training of bilingual parallel corpus processors on the knowledge and skills required for corpus processing can: a) provide bilingual parallel corpus processors with the skills required for corpus processing; b)
Have assistants to meet the growing demand for corpus processing and improve efficiency; e
Promote the development and innovation of bilingual parallel corpus processing technology A,2 The training of bilingual parallel corpus processing personnel may include: advanced semantic processing skills, using scripts to process bilingual texts; a)
b) Desensitization and corpus cleaning (including removing garbled characters, formatting tags, etc. in the corpus) technology to better handle the scenario of bilingual corpus alignment;
Use quality tools to perform quality checks at the end of the project, such as checking the legality of the format, etc.: 8
rrKaeerkAca-3
Data Storage
The service provider shall store the customer's demand documents, processing documents and final delivery documents by customer and period for corpus information query and customer tracking:
-rrKaeerKAca-
GB/T40035—2021
Appendix A
(Informative)
Training of Bilingual Parallel Corpus Processors
A.1 The training of bilingual parallel corpus processors on the knowledge and skills required for corpus processing can: a) provide bilingual parallel corpus processors with the skills required for corpus processing; b)
Have assistants to meet the growing demand for corpus processing and improve efficiency; e
Promote the development and innovation of bilingual parallel corpus processing technology A,2 The training of bilingual parallel corpus processing personnel may include: advanced semantic processing skills, using scripts to process bilingual texts; a)
b) Desensitization and corpus cleaning (including removing garbled characters, formatting tags, etc. in the corpus) technology to better handle the scenario of bilingual corpus alignment;
Use quality tools to perform quality checks at the end of the project, such as checking the legality of the format, etc.: 8
rrKaeerkAca-
Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.