title>Information and documentation—WARC file format - GB/T 33994-2017 - Chinese standardNet - bzxz.net
Home > GB > Information and documentation—WARC file format
Information and documentation—WARC file format

Basic Information

Standard ID: GB/T 33994-2017

Standard Name:Information and documentation—WARC file format

Chinese Name: 信息和文献 WARC文件格式

Standard category:National Standard (GB)

state:in force

Date of Release2017-07-12

Date of Implementation:2018-02-01

standard classification number

Standard ICS number:Information technology, office machinery and equipment>>Information technology applications>>35.240.30 Information technology in information, documentation and publication

Standard Classification Number:General>>Economy, Culture>>A14 Library, Archives, Documentation and Information Work

associated standards

Procurement status:ISO 28500:2009

Publication information

publishing house:China Standards Press

Publication date:2017-07-20

other information

drafter:Mao Yajun, Li Chunming, Wu Zhenxin, Zhen Qin, Qu Yunpeng, Zhang Xiaodan, Zhang Lan, Yang He, Dun Wenjie, Zhang Biao

Drafting unit:National Library of China, Documentation and Information Center of Chinese Academy of Sciences, National Defense Science and Technology Information Center of China, China Institute of Scientific and Technical Information, Beijing Wanfang Data Co., Lt

Focal point unit:National Technical Committee for Information and Documentation Standardization (SAC/TC 4)

Proposing unit:National Technical Committee for Information and Documentation Standardization (SAC/TC 4)

Publishing department:General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China Standardization Administration of China

Introduction to standards:

GB/T 33994-2017 Information and Documentation WARC File Format GB/T33994-2017 |tt||Standard compression package decompression password: www.bzxz.net
This standard specifies the WARC file format:--Store payload content and control information from mainstream Internet application layer protocols (such as HTTP, DNS and FTP);--Store arbitrary metadata related to other stored data (such as subject classification, language, encoding);--Support data compression and ensure the integrity of data records;--Store all control information from the harvesting protocol (such as request header information), not just response information;--Store data conversion results related to other stored data;--Store repeated monitoring activities related to other stored data (when the same or substantially similar resources appear, storage consumption can be reduced);--Expand without interrupting current functions;--Support truncation or segmentation of overlong records where required.


Some standard content:

ICS35.240.30
National Standard of the People's Republic of China
GB/T33994—2017/ISO28500:2009 Information and documentation
WARC file format
Information and documentationWARC file format(ISO28500.2009,IDT)
Published on 2017-07-12
General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China Administration of Standardization of the People's Republic of China
Implementation on 2018-02-01
This standard was drafted according to the rules given in GB/T1.1-2009. GB/T33994—2017/IS028500:2009 This standard uses the translation method equivalent to ISO28500:2009 "Information and Documentation WARC File Format". The Chinese documents that have a consistent correspondence with the international documents normatively referenced in this standard are as follows: GB/T7408-2005 Data element and exchange format Information exchange date and time representation (ISO8601:2000, IDT).
This standard has made the following editorial changes:
Added abbreviations: LWS, MIME, US-ASCI (see 3.2): - In order to enhance readability, on the basis of retaining the examples in the international standard, some examples are replaced with domestic examples (see Appendix B). This standard is proposed and managed by the National Information and Documentation Standardization Technical Committee (SAC/TC4). The drafting units of this standard are: National Library, Documentation and Information Center of the Chinese Academy of Sciences, China National Defense Science and Technology Information Center, China Institute of Scientific and Technical Information, and Beijing Wanfang Data Co., Ltd. The main drafters of this standard are Mao Yajun, Li Chunming, Wu Zhenxin, Zhenqi, Qu Yunpeng, Zhang Xiaodan, Zhang Lan, Yang He, Dun Wenjie, and Zhang Biao. I
HiiKAoNiKAca
GB/T339942017/IS028500:2009
Every day, websites and web pages appear or disappear from the Internet. For more than a decade, memory storage organizations have tried to find the most suitable way to collect and track and record massive amounts of important information using network-scale tools (such as web crawlers). At the same time, memory storage organizations have an increasing need to preserve digital resources that are not captured on the Internet (such as a complete set of electronic journals or data generated by environmental sensing devices). There is a need for a file format that can simply and securely carry a large number of data objects that make up a file through a file for storage, management, and exchange.
The WARC (WebARChive) file format provides a protocol for connecting multiple resource records (data objects) into a long file, where each resource record consists of a set of simple text headers and arbitrary data content blocks. The WARC format is an extension of the ARC file format. The WARC format will serve as a standard for organizing, managing, and storing billions of digital resources collected from the Internet and other places, and can be used to build various applications such as harvesting (such as the Heritrix web crawler, an open source software), managing, accessing, and exchanging content. In addition to the original content recorded with ARC, the extended WARC format also accommodates related secondary content, such as distributed metadata, reduced duplication detection activities, post-conversion, and segmentation of large resources. H
HiiKAoNiKAca
1 Scope
Information and documentation
This standard specifies the WARC file format:
GB/T33994—2017/ISO28500:2009WARC file format
Store payload content and control information from mainstream Internet application layer protocols (such as HTTP, DNS and FTP); Store arbitrary metadata related to other stored data (such as subject classification, language, encoding): - Support data compression and ensure the integrity of data records; Store all control information from the harvesting protocol (such as request header information), not just response information; Store data conversion results related to other stored data: Store repeated monitoring activities related to other stored data (when the same or substantially similar resources appear, storage consumption can be reduced);
Expand without interrupting current functions; Support truncation or segmentation of overlong records where required. 2 Normative references
The following documents are indispensable for the application of this document. For any dated referenced document, only the dated version applies to this document. For any undated referenced document, the latest version (including all amendments) applies to this document. ISO8601Data elements and interchange formats-Information interchange-Representation of dates and timesRFC1035Domain names—Implementation and specificationRFC1884IPV6 address architectureRFC2045
Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Body[Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Body Bodies Detached Domain Name System (DNS) Information RFC2540
RFC2616. Hypertext Transfer Protocol—HTTP/1.1 RFC2822
RFC3629
RFC3986
RFC4027
W3C DTF
Internet Message Format UTF-8—a transformation format of ISO 10646 Uniform Resource Identifier (URI): Generic Syntax Uniform Resource Identifier (URD): Generic Syntax Domain Name System Media Type (Domain Name System Media Type) ediaTypes) Date and Time Formats: note submitted to the W3C 3 Terms, Definitions and Abbreviations
3.1 Terms and Definitions
The following terms and definitions apply to this document. 3.1.1
WARC recordWARC record
The basic component of a WARC file. A WARC file consists of a sequence of WARC records. 1
iiKAoiKAca
GB/T33994—2017/ISO28500:20093.1.2
WARC record content blockWARC record content block A part of a WARC record (0 or more octets), located after the header information, is the main part of the WARC record. 3.1.3
WARC record payloadWARC record payload The data object pointed to or contained by a WARC record is a meaningful subset of the content block. 3.1.4
WARC record headerWARCrecordheaderThe beginning of a WARC record. The first line declares that the record is in the WARC format of a given version number, followed by multiple named field lines, ending with a blank line.
WARC named fieldsWARCnamedfields
A set of elements, including a name, a colon, and a value. Long values ​​are continued in indented lines. 3.1.6
WARC logical recordWARClogical recordIn the case of segmentation, a WARC logical record consists of multiple segments, each segment is represented by a WARC record. 3.2 Abbreviations
The following abbreviations apply to this document.
ABNF: augmented Backus-Naur formARC: archive
CRLF: carriage return line feed feed) DNS: Domain Name System (domain name system) FTP: File Transfer Protocol (file transfer protocol) HTTP: Hypertext Transfer Protocol (hypertext transfer protocol) IANA: Internet Assigned Numbers Authority (Internet Assigned Numbers Authority) IESG: Internet Engineering Steering Group (Internet Engineering Steering Group) LWS: Linear White Space (Linear White Space) MIME: Multipurpose Internet Mail Extensions (Multipurpose Internet Mail Extensions) RFC: Request for Comments (request for comments) UR (I/L/N): Uniform Resource (identifier/locator/name) [uniform resource (identifier/locator/name) US-ASCII: American Standard Code for Information Interchange (American Standard Code for Information Interchange) WARC: Internet Archive (webarchive) 4 File and Record Model
WARC format file consists of several WARC records in sequence. The first record is usually used to describe the subsequent multiple records. Typically, the content of a record is either the direct result of a search (a web page, embedded image, URL redirect information, DNS host name query results, independent files, etc.), or a comprehensive resource that provides additional information for the archived content (such as metadata, converted content). Each WARC record consists of a record header, a record content block, and two line breaks. The first line of the WARC record header is used to declare the version number of the WARC format used by the record, followed by an indefinite number of named fields terminated by a blank line. The format of the WARC record header should follow the general rules of HTTP/1.1 [RFC2616] and [RFC2822] headers, with the exception that UTF-8 characters as specified in [RFC3629] can be used. 2
HiiKAoNiKAca
GB/T33994—2017/ISO28500:2009The item-level view of a WARC file can be expressed using ABNF (Augmented Backus-Naur Notation) syntax, reusing the extended concepts defined in 2.1 of HTTP/1.1 [RFC2616] (Special note: To avoid confusion, when the WARC rule name is the same as the [RFC2616] rule name, the definition is also the same, except that when the CHAR rule is used, WARC contains multi-byte UTF-8 characters). warc-file
-1* warc-record
=headerCRLF
warc-record
, a WARC file includes at least 1 WARC recordblockCRLFCRLF
; a WARC record includes a header and a carriage return and line feed content block and 2 carriage return and line feeds
header
version
warc-fields
=versionwarc-fields; the header includes version and warc fields=\WARC/1.0CRLF; the version is "WARC/1.0" and carriage return and line feed=*named-fieldCRLFWARC fields include several named fields and carriage return and line feed=*OCTET
the content block includes several octets
the record version should appear at the beginning of each record, that is, the beginning of the WARC file. The named fields in WARC records are very important. Each named field consists of a name, a colon (:), and a field value. Field names are case-insensitive. There can be consecutive spaces before the field value, but it is recommended to use only one space. Header fields can extend to multiple lines, but each line is preceded by at least one space or tab character.
Named fields can appear in any order, and their field values ​​can contain any UTF-8 characters. Both defined fields and extended fields follow the general named field format. Extended fields can be used to extend the core format. named-field=field-name\,\[field-value]The named field is the field name ",\"[field value]field-name=token
field-value
=*(field-content |LWS)
field-content=《Multiple OCTETs that constitute the field value; the field name is token
the field value is several (field content|continuous spaces) including several TEXTs, or a combination of token
separators
=《Any 8-bit sequence of data)
=1*《Any US-ASCII character except CTL or separators); at least one US-ASCII character except CTL or separators
=\(\|\)\|\(\|\)\\@\separator|\,\|\,\\,\|\9|(\)
|\/\|\[\|\了\|\?\\”\I\{\|\}\ISP|HT
<Any OCT except CTL ET, including LWS) <Any octet except CTL, including consecutive spaces>
= (UTF-8 character; RFC3629)
: (0-191,194-244)
= <Any US-ASCII digit "o\ to "g" three <Any US-ASCII control character
(octet 0-31) and DEL(127)) <ASCII CR, carriage return >
=ASCIILF, line feed)
=《ASCISP, space)
=(ASCIIHT, horizontal tab)
=[CRLF]1*(SP|HT)
;Carriage return and line feed
;Semantically the same as a single space
HiiKAoiKAca
GB/T33994—2017/IS028500.2009quoted-string=(<\>*(qdtext|quoted-pair)<\>);qdtext
=《Any TEXT except《\>》
quoted-pair -\\ CHAR
;Single character reference
=\(\URI\perRFC3986)\)\
When writing WARC fields, UTF-8 characters may be used, or the 'encoded-word' mechanism of [RFC2047] may be used, and needs to be interpreted by WARC reading software.
The rest of the WARC record syntax involves parameters for defined fields, such as record identifier, record type, creation time, content length, and content type.
defined-field
=WARC-Type
I WARC-Record-ID
IWARC-Date
I Content-Length
I Content-Type
IWARC-Concurrent-To
IWARC-Block-Digest
IWARC-Payload-Digest
I WARC-IP-Address
IWARC-Refers-To
IWARC-Target-URI
IWARC-Truncated
IWARC-Warcinfo-ID|| tt||IWARC-Filename
IWARC-Profile
I WARC-Identified-Payload-TypeIWARC-Segment-Origin-ID
IWARC-Segment-Number
IWARC-Segment-Total-Length
Defined field
; only applies to warcinfo record type
; only applies to revisit record type
; only applies to continuation record type, only applies to continuation record type Each WARC record SHOULD have a record type, described in the "WARC-Type field. The following eight WARC record types are defined in this standard:
warcinfo'.
"response',
-"resource',
"request',
"metadata',
-'revisit'
-"conversion',
\continuation'
Other WARC record types may be extended based on the core format. The relevant fields of each record type are detailed in Chapter 6, and the meaning and legal value format of each field are detailed in Chapter 5. The content block of the record shall contain octets interpreted based on the record type and other header values. All records shall contain a Content-Length field to specify the length of the content block. 4
Hii KAoNi KAca
GB/T33994—2017/ISO28500:2009Some record types (and possible record types in the future) define a payload, which may be a meaningful part of the content block or the content of a previous record. Some headers are related to the payload of the record and are not directly related to the content block of the record. For example, in a "response\ record, its content block consists of HTTP headers and a data object, and its payload is the data object. All "response", "resource', request", conversion and "continuation" records can have a payload. All "warcinfo", "metadata" and "revisit' records should not have a payload. Content that conforms to the warc-file rule should have the MIME content-type "application/warc", as described in 8.2. Content that conforms only to the warc-fields rule is very useful as a simple description format and has the MIME content-type "application/warc-fields", as described in 8.3. 5 Named Fields
5.1 Overview
The named fields in a WARC record provide information about the current record. WARC may either reuse appropriate headers from other standards or define new ones. These headers all begin with "WARC-", depending on the specific target of the "WARC-". WARC named fields of the same type MUST NOT be repeated in the same WARC record (e.g., there MUST NOT be more than one "WARC-Date\" or "WARC-Target-URI\" in a WARC record), unless otherwise noted (e.g., WARC-Concurrent-To). Since new fields MAY be defined in extensions to the WARC core format, WARC processing software SHOULD ignore fields for which it does not recognize the name.
5.2 WARC-Record-ID (Required)
The WARC-Record-ID is an identifier assigned to the current record and is globally unique during use. This standard does not mandate the use of an identifier system. Each WARC-Record-ID should be a valid URI that clearly indicates the record and registration system it follows (e.g., a URI system prefix, "HTTP:" or \URN:"). Note that there should be no spaces in the value. WARC-Record-ID-\WARC-Record-ID\\.\uri All records should have a WARC-Record-ID field. 5.3 Content-Length (Required) || tt || Content-Length is eight characters in the content block. The number of bytes, similar to the definition in [RFC2616]. If there is no content block, the value should be "0".
Content-Length=\Content-Length\:\1*DIGIT All records should have a Content-Length field. 5.4WARC-Date (Required)
WARC-Date is a 14-digit time segment in the format of YYYY-MM-DDThh:mm:ssZ. Follow ISO8601. The time segment should represent the data capture, The instantaneous point in time when the record is created. When multiple records are recorded as a capture event (see 5.7), the same WARC-Date should be used, even though the records are not strictly written at the same time. WARC-Date=\WARC-Date\.\GB/T7408-2005GB/T7408—2005=(YYYY-MM-DDThh:mm:$sZ) All records should have a WARC-Date field. For examples of using the WARC-Date field, see Appendix A. 5.5WARC-Type (required)|| tt||WARC-Type is the type of WARC record. The record types defined in this standard are:5
HiiKAoNiKAca
GB/T33994—2017/ISO28500:2009warcinfo'
-'response',
“resource',
-request,
-'metadata',
'revisit\
-\conversion',
-\continuation'
Other types of WARC records may be defined in extensions to the core format. These types are described in detail in Section 6. WARC files need not contain any particular record type, but it is recommended that all WARC files begin with a 'warcinfo' record.
e-\WARC-Type\\.\record-typeWARC-Type
record-type-\warcinfo\|\response\|\resource\I\request\|\metadata\「\revisit\I\conversion\I\continuation\|future-typefuture-type=token
All records SHOULD have a WARC-Type field. WARC processing software SHOULD ignore records of an unrecognizable type. Examples of the use of the WARC-Type field are given in Appendix A. 5.6 Content-Type
The Content-Type field is the MIME type (as defined in [RFC2045]) of the information in the content block of the record. For example, in HTTP request and response records, the value is 'application/http' as defined in 19.1 of [RFC2616] (or 'application/http;msgtype=request' and 'application/http;msgtype=response', respectively). In the special case, content-type is not the value of the HTTP Content-Type header in the HTTP response, but a MIME type used to describe the complete archived HTTP message (if the content block contains request or response headers, the value is 'application/http'). Content-Type -\Content-Type\\.\media-typemedia-type
subtype
parameter
attribute
=type\/\subtype*(\\parameter)=token
=token
-attribute\-\ value
=token
=tokenI quoted-string
Except for "continuation\ records, all records whose content blocks are not empty (Content-Length value is not zero) should have a Content-Type field. Only when the Content-Type field does not provide a media type, the reader may guess its media type based on the content and/or URI extension of the resource. If the media type is still unrecognizable, the reader should treat it as "application/octet-stream".
5.7WARC-Concurrent-To
One or more WARC-Concurrent-To fields contain the WARC-Record-ID of one or more records that belong to the same crawl event as the current record. A crawl event includes all the information automatically collected when retrieving a WARC-Target-URI. For example, it may be a "response" or "revisit" record and an associated "request\ record. WARC-Concurrent-To=\WARC-Concurrent-To\\uri6
GB/T33994—2017/IS028500:2009 Records of type "request", "response", "resource", "metadata", and "revisit" generated by the same crawl event can be interconnected through the WARC-Concurrent-To field (when used in this way, even if the header only appears in one record, the WARCConcurrent-To association is bidirectional). The WARC-Concurrent-To field does not apply to "warcinfo", "conversion", and "continuation" records.
As a special case, the WARC-Concurrent-To field can be repeated in the same WARC record. For examples of the use of the WARC-Concurrent-To field, see Appendix A. 5.8 WARC-Block-Digest
WARC-Block-Digest is an optional parameter that indicates the algorithm name and calculated value applied to the digest of all record content blocks. WARC-Block-Digest -\WARC-Block-Digest\\\labelled-digestlabelled-digest
algorithm
digest-value
-algorithm\\digest-value
=token
=token
The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Block-Digest:shal:AB2CD3EF4GH5IJ6KL7MN8OPQNo recommended algorithm.
Any record MAY have a WARC-Block-Digest field. 5.9 WARC-Payload-Digest
WARC-Payload-Digest is an optional parameter that indicates the algorithm name and calculated value of the payload digest pointed to or contained by this record. This value is not necessarily equal to the value of WARC-Block-Digest. WARC-Payload-Digest=\WARC-Payload-Digest\\\labelled-digest The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Payload-Digest: shal:3EF4GH5IJ6KL7MN8OPQAB2CD No algorithm is recommended.
The payload of an "application/http" content block is its "entity-body (see [RFC2616]). Unlike WARCBlock-Digest, the WARC-Payload-Digest field can also be used for data that does not appear in the current record's content block. For example, when a content block is discarded according to the "revisit\ profile (see 6.7) or when a record is fragmented (the WARC-Payload-Digest recorded in the first fragment of a fragmented record should be the payload digest of the logical record). The WARC-Payload-Digest field can be used for WARC records with a well-defined payload, but not for records without a well-defined payload.
5.10WARC-IP-Address
WARC-IP-Address is the numeric Internet address for retrieving the desired content. IPv4 addresses are dotted quads. IPv6 addresses should be written as defined in [RFC1884]. An HTTP search is to retrieve the IP address corresponding to the host in the target-URI.WARC-IP-Address-\WARC-IP-Address\\.\(ipv4[ipv6)ipv4
=《\dotted quad\)
=6 Content-Type
The Content-Type field is the MIME type (defined in [RFC2045]) of the information in the record content block. For example, in HTTP request and response records, the value is "application/http\" as defined in 19.1 of [RFC2616] (or "application/http;msgtype=request' and "application/http;msgtype=response", respectively). In particular, content-type is not the value of the HTTP Content-Type header in the HTTP response, but is a MIME type used to describe the complete archived HTTP message (if the content block contains a request or response header, the value is "application/http\"). Content-Type -\Content-Type\\.\media-typemedia-type
subtype
parameter
attribute
=type\/\subtype*(\\parameter)=token
=token
-attribute\-\ value
=token
=tokenI quoted-string
Except for "continuation\ records, all records whose content blocks are not empty (Content-Length value is not zero) should have a Content-Type field. Only when the Content-Type field does not provide a media type, the reader may guess its media type based on the content and/or URI extension of the resource. If the media type is still unrecognizable, the reader should treat it as "application/octet-stream".
5.7WARC-Concurrent-To
One or more WARC-Concurrent-To fields contain the WARC-Record-ID of one or more records that belong to the same crawl event as the current record. A crawl event includes all the information automatically collected when retrieving a WARC-Target-URI. For example, it may be a "response" or "revisit" record and an associated "request\ record. WARC-Concurrent-To=\WARC-Concurrent-To\\uri6
GB/T33994—2017/IS028500:2009 Records of type "request", "response", "resource", "metadata", and "revisit" generated by the same crawl event can be interconnected through the WARC-Concurrent-To field (when used in this way, even if the header only appears in one record, the WARCConcurrent-To association is bidirectional). The WARC-Concurrent-To field does not apply to "warcinfo", "conversion", and "continuation" records.
As a special case, the WARC-Concurrent-To field can be repeated in the same WARC record. For examples of the use of the WARC-Concurrent-To field, see Appendix A. 5.8 WARC-Block-Digest
WARC-Block-Digest is an optional parameter that indicates the algorithm name and calculated value applied to the digest of all record content blocks. WARC-Block-Digest -\WARC-Block-Digest\\\labelled-digestlabelled-digest
algorithm
digest-value
-algorithm\\digest-value
=token
=token
The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Block-Digest:shal:AB2CD3EF4GH5IJ6KL7MN8OPQNo recommended algorithm.
Any record MAY have a WARC-Block-Digest field. 5.9 WARC-Payload-Digest
WARC-Payload-Digest is an optional parameter that indicates the algorithm name and calculated value of the payload digest pointed to or contained by this record. This value is not necessarily equal to the value of WARC-Block-Digest. WARC-Payload-Digest=\WARC-Payload-Digest\\\labelled-digest The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Payload-Digest: shal:3EF4GH5IJ6KL7MN8OPQAB2CD No algorithm is recommended.
The payload of an "application/http" content block is its "entity-body (see [RFC2616]). Unlike WARCBlock-Digest, the WARC-Payload-Digest field can also be used for data that does not appear in the current record's content block. For example, when a content block is discarded according to the "revisit\ profile (see 6.7) or when a record is fragmented (the WARC-Payload-Digest recorded in the first fragment of a fragmented record should be the payload digest of the logical record). The WARC-Payload-Digest field can be used for WARC records with a well-defined payload, but not for records without a well-defined payload.
5.10WARC-IP-Address
WARC-IP-Address is the numeric Internet address for retrieving the desired content. IPv4 addresses are dotted quads. IPv6 addresses should be written as defined in [RFC1884]. An HTTP search is to retrieve the IP address corresponding to the host in the target-URI.WARC-IP-Address-\WARC-IP-Address\\.\(ipv4[ipv6)ipv4
=《\dotted quad\)
=6 Content-Type
The Content-Type field is the MIME type (defined in [RFC2045]) of the information in the record content block. For example, in HTTP request and response records, the value is "application/http\" as defined in 19.1 of [RFC2616] (or "application/http;msgtype=request' and "application/http;msgtype=response", respectively). In particular, content-type is not the value of the HTTP Content-Type header in the HTTP response, but is a MIME type used to describe the complete archived HTTP message (if the content block contains a request or response header, the value is "application/http\"). Content-Type -\Content-Type\\.\media-typemedia-type
subtype
parameter
attribute
=type\/\subtype*(\\parameter)=token
=token
-attribute\-\ value
=token
=tokenI quoted-string
Except for "continuation\ records, all records whose content blocks are not empty (Content-Length value is not zero) should have a Content-Type field. Only when the Content-Type field does not provide a media type, the reader may guess its media type based on the content and/or URI extension of the resource. If the media type is still unrecognizable, the reader should treat it as "application/octet-stream".
5.7WARC-Concurrent-To
One or more WARC-Concurrent-To fields contain the WARC-Record-ID of one or more records that belong to the same crawl event as the current record. A crawl event includes all the information automatically collected when retrieving a WARC-Target-URI. For example, it may be a "response" or "revisit" record and an associated "request\ record. WARC-Concurrent-To=\WARC-Concurrent-To\\uri6
GB/T33994—2017/IS028500:2009 Records of type "request", "response", "resource", "metadata", and "revisit" generated by the same crawl event can be interconnected through the WARC-Concurrent-To field (when used in this way, even if the header only appears in one record, the WARCConcurrent-To association is bidirectional). The WARC-Concurrent-To field does not apply to "warcinfo", "conversion", and "continuation" records.
As a special case, the WARC-Concurrent-To field can be repeated in the same WARC record. For examples of the use of the WARC-Concurrent-To field, see Appendix A. 5.8 WARC-Block-Digest
WARC-Block-Digest is an optional parameter that indicates the algorithm name and calculated value applied to the digest of all record content blocks. WARC-Block-Digest -\WARC-Block-Digest\\\labelled-digestlabelled-digest
algorithm
digest-value
-algorithm\\digest-value
=token
=token
The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Block-Digest:shal:AB2CD3EF4GH5IJ6KL7MN8OPQNo recommended algorithm.
Any record MAY have a WARC-Block-Digest field. 5.9 WARC-Payload-Digest
WARC-Payload-Digest is an optional parameter that indicates the algorithm name and calculated value of the payload digest pointed to or contained by this record. This value is not necessarily equal to the value of WARC-Block-Digest. WARC-Payload-Digest=\WARC-Payload-Digest\\\labelled-digest The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Payload-Digest: shal:3EF4GH5IJ6KL7MN8OPQAB2CD No algorithm is recommended.
The payload of an "application/http" content block is its "entity-body (see [RFC2616]). Unlike WARCBlock-Digest, the WARC-Payload-Digest field can also be used for data that does not appear in the current record's content block. For example, when a content block is discarded according to the "revisit\ profile (see 6.7) or when a record is fragmented (the WARC-Payload-Digest recorded in the first fragment of a fragmented record should be the payload digest of the logical record). The WARC-Payload-Digest field can be used for WARC records with a well-defined payload, but not for records without a well-defined payload.
5.10WARC-IP-Address
WARC-IP-Address is the numeric Internet address for retrieving the desired content. IPv4 addresses are dotted quads. IPv6 addresses should be written as defined in [RFC1884]. An HTTP search is to retrieve the IP address corresponding to the host in the target-URI.WARC-IP-Address-\WARC-IP-Address\\.\(ipv4[ipv6)ipv4
=《\dotted quad\)
=\media-typemedia-type
subtype
parameter
attribute
=type\/\subtype*(\\parameter)=token
=token
-attribute\-\ value
=tokenwwW.bzxz.Net
=tokenI quoted-string
Except for "continuation\ records, all records whose content blocks are not empty (Content-Length value is not zero) should have a Content-Type field. This field is used only when the Content-Type field does not provide a media type. When the resource is identified, the reader may guess its media type based on the resource's content and/or URI extension. If the media type is still not recognized, the reader should treat it as "application/octet-stream". || tt||5.7WARC-Concurrent-To
One or more WARC-Concurrent-To fields contain the WARC-Record-ID of one or more records that belong to the same crawl event as the current record. The retrieval event includes all the information automatically collected when retrieving a certain WARC-Target-URI. For example, it may be a "response" or "revisit" record and the related "request\ WARC-Concurrent-To=\WARC-Concurrent-To\\uri6
GB/T33994—2017/IS028500:2009 "request", "response", "resource", "metadata" generated by the same crawl event and "revisit" records can be interconnected via the WARC-Concurrent-To field (when used this way, even if the header appears in only one record, the WARCConcurrent-To association is bidirectional). The WARC-Concurrent-To field does not apply to "warcinfo", "conversion" and "continuation" records.
As a special case, the WARC-Concurrent-To field can be repeated in the same WARC record. Example usage of the WARC-Concurrent-To field See Appendix A. 5.8 WARC-Block-Digest
WARC-Block-Digest is an optional parameter that indicates the algorithm name and calculated value to be applied to all record content block digests. -\WARC-Block-Digest\\\labelled-digestlabelled-digest
algorithm
digest-value
-algorithm\\digest-value
=token|| tt||=token
The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Block-Digest:shal:AB2CD3EF4GH5IJ6KL7MN8OPQNo recommended algorithm.
Any record is acceptable There is a WARC-Block-Digest field. 5.9 WARC-Payload-Digest
WARC-Payload-Digest is an optional parameter that indicates the algorithm name and calculated value of the payload digest pointed to or contained in the record. This value is not necessarily equal to the value of WARC-Block-Digest. WARC-Payload-Digest=\WARC-Payload-Digest\\\labelled-digest The following is an example of a Base32 ([RFC3548) value represented by SHA-1: WARC-Payload-Digest: shal:3EF4GH5IJ6KL7MN8OPQAB2CD No recommended algorithm. || tt||The payload of an "application/http" content block is its "entity-body" (see [RFC2616]). Unlike WARC Block-Digest, the WARC-Paload-Digest field may also be used to identify the content block that is not currently being recorded. The data that appears. For example, when a content block is discarded according to the "revisit_profile" (see 6.7), or when a record is fragmented (the WARC-Payload-Digest recorded in the first fragment of a fragmented record should be a valid logical record). The WARC-Payload-Digest field can be used for WARC records with a well-defined payload, but not for records without a well-defined payload.
5.10WARC-IP-Address
WARC- IP-Address is a digital Internet address for retrieving the desired content. IPv4 address is a dotted quad. IPv6 address should be written in accordance with [RFC1884 definition. An HTTP search is to retrieve the IP address corresponding to the host in the target-URI. Address. WARC-IP-Address-\WARC-IP-Address\\.\(ipv4[ipv6)ipv4
=《\dotted quad\)
=\media-typemedia-type
subtype
parameter
attribute
=type\/\subtype*(\\parameter)=token
=token
-attribute\-\ value
=token
=tokenI quoted-string
Except for "continuation\ records, all records whose content blocks are not empty (Content-Length value is not zero) should have a Content-Type field. This field is used only when the Content-Type field does not provide a media type. When the resource is identified, the reader may guess its media type based on the resource's content and/or URI extension. If the media type is still not recognized, the reader should treat it as "application/octet-stream". || tt||5.7WARC-Concurrent-To
One or more WARC-Concurrent-To fields contain the WARC-Record-ID of one or more records that belong to the same crawl event as the current record. The retrieval event includes all the information automatically collected when retrieving a certain WARC-Target-URI. For example, it may be a "response" or "revisit" record and the related "request\ WARC-Concurrent-To=\WARC-Concurrent-To\\uri6
GB/T33994—2017/IS028500:2009 "request", "response", "resource", "metadata" generated by the same crawl event and "revisit" records can be interconnected via the WARC-Concurrent-To field (when used this way, even if the header appears in only one record, the WARCConcurrent-To association is bidirectional). The WARC-Concurrent-To field does not apply to "warcinfo", "conversion" and "continuation" records.
As a special case, the WARC-Concurrent-To field can be repeated in the same WARC record. Example usage of the WARC-Concurrent-To field See Appendix A. 5.8 WARC-Block-Digest
WARC-Block-Digest is an optional parameter that indicates the algorithm name and calculated value to be applied to all record content block digests. -\WARC-Block-Digest\\\labelled-digestlabelled-digest
algorithm
digest-value
-algorithm\\digest-value
=token|| tt||=token
The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Block-Digest:shal:AB2CD3EF4GH5IJ6KL7MN8OPQNo recommended algorithm.
Any record is acceptable There is a WARC-Block-Digest field. 5.9 WARC-Payload-Digest
WARC-Payload-Digest is an optional parameter that indicates the algorithm name and calculated value of the payload digest pointed to or contained in the record. This value is not necessarily equal to the value of WARC-Block-Digest. WARC-Payload-Digest=\WARC-Payload-Digest\\\labelled-digest The following is an example of a Base32 ([RFC3548) value represented by SHA-1: WARC-Payload-Digest: shal:3EF4GH5IJ6KL7MN8OPQAB2CD No recommended algorithm. || tt||The payload of an "application/http" content block is its "entity-body" (see [RFC2616]). Unlike WARC Block-Digest, the WARC-Paload-Digest field may also be used to identify the content block that is not currently being recorded. The data that appears. For example, when a content block is discarded according to the "revisit_profile" (see 6.7), or when a record is fragmented (the WARC-Payload-Digest recorded in the first fragment of a fragmented record should be a valid logical record). The WARC-Payload-Digest field can be used for WARC records with a well-defined payload, but not for records without a well-defined payload.
5.10WARC-IP-Address
WARC- IP-Address is a digital Internet address for retrieving the desired content. IPv4 address is a dotted quad. IPv6 address should be written in accordance with [RFC1884 definition. An HTTP search is to retrieve the IP address corresponding to the host in the target-URI. Address. WARC-IP-Address-\WARC-IP-Address\\.\(ipv4[ipv6)ipv4
=《\dotted quad\)
=AB2CD3EF4GH5IJ6KL7MN8OPQNo recommended algorithm.
Any record MAY have a WARC-Block-Digest field. 5.9 WARC-Payload-Digest
WARC-Payload-Digest is an optional parameter that indicates the algorithm name and calculated value of the payload digest pointed to or contained by this record. This value is not necessarily equal to the value of WARC-Block-Digest. WARC-Payload-Digest=\WARC-Payload-Digest\\\labelled-digest The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Payload-Digest: shal:3EF4GH5IJ6KL7MN8OPQAB2CDNo recommended algorithm.
The payload of an "application/http" content block is its "entity-body" (see [RFC2616]). Unlike the WARC Block-Digest, the WARC-Payload-Digest field may also be used for data that is not present in the content block of the current record. For example, when a content block is discarded according to the "revisit\profile" (see 6.7), or when a record is fragmented (the WARC-Payload-Digest recorded in the first fragment of a fragmented record SHOULD be the payload digest of the logical record). The WARC-Payload-Digest field may be used for WARC records with a well-defined payload, but not for records without a well-defined payload.
5.10 WARC-IP-Address
The WARC-IP-Address is the numeric Internet address from which the desired content is to be retrieved. IPv4 addresses are dotted quads. IPv6 addresses SHOULD be written as defined in [RFC1884]. An HTTP retrieval is the retrieval of the IP address corresponding to the host in the target-URI. WARC-IP-Address-\WARC-IP-Address\\.\(ipv4[ipv6)ipv4
=《\dotted quad\)
=AB2CD3EF4GH5IJ6KL7MN8OPQNo recommended algorithm.
Any record MAY have a WARC-Block-Digest field. 5.9 WARC-Payload-Digest
WARC-Payload-Digest is an optional parameter that indicates the algorithm name and calculated value of the payload digest pointed to or contained by this record. This value is not necessarily equal to the value of WARC-Block-Digest. WARC-Payload-Digest=\WARC-Payload-Digest\\\labelled-digest The following is an example of a Base32 ([RFC3548]) value represented by SHA-1: WARC-Payload-Digest: shal:3EF4GH5IJ6KL7MN8OPQAB2CDNo recommended algorithm.
The payload of an "application/http" content block is its "entity-body" (see [RFC2616]). Unlike the WARC Block-Digest, the WARC-Payload-Digest field may also be used for data that is not present in the content block of the current record. For example, when a content block is discarded according to the "revisit\profile" (see 6.7), or when a record is fragmented (the WARC-Payload-Digest recorded in the first fragment of a fragmented record SHOULD be the payload digest of the logical record). The WARC-Payload-Digest field may be used for WARC records with a well-defined payload, but not for records without a well-defined payload.
5.10 WARC-IP-Address
The WARC-IP-Address is the numeric Internet address from which the desired content is to be retrieved. IPv4 addresses are dotted quads. IPv6 addresses SHOULD be written as defined in [RFC1884]. An HTTP retrieval is the retrieval of the IP address corresponding to the host in the target-URI. WARC-IP-Address-\WARC-IP-Address\\.\(ipv4[ipv6)ipv4
=《\dotted quad\)
=
The WARC-IP-Address field may be used in 'response', 'resource', 'request', 'metadata', and 'revisit' records, but not in 'warcinfo', 'conversion', or 'continuation' records. 5.11WARC-Refers-To
If the current record contains additional content for a WARC record, the WARC-Refers-To field of the current record contains the WARC-Record-id of that record.
WARC-Refers-To-\WARC-Refers-To\\,\uriThe WARC-Refers-To field can be used to associate a 'metadata' record with the other record it describes. The WARC-Refers-To field can also be used to associate records of type 'revisit' or 'conversion', which helps to determine the previous record of the current record's content. The WARC-Refers-To field shall not be used in warcinfo, 'response', 'request', and continuation records. See Appendix A for examples of the use of the WARC-Refers-To field. 5.12 WARC-Target-URI
The WARC-Target-URI is the original URI used to crawl the information content of this record. When performing web harvesting, it is the URI of the target of the web crawler's retrieval request. For a 'revisit' record, it is the URI of the target of the retrieval request. Indirectly, for a 'metadata' or 'conversion' record, the WARC-Target-URI copies the WARC-Target-URI that appears in the original record and is associated with the new record. The URI value should be written in accordance with the definition of [RFC3986]. WARC-Target-URI\WARC-Target-URI\\.\uri All 'response', 'resource', 'request', 'revisit', 'conversion' and 'continuation' records should have a WARC-Target-URI field. "metadata\ records MAY have a WARC-Target-URI field. "warinfo" records SHOULD NOT have a WARC-Target-URI field.
5.13 WARC-Truncated
For practical purposes, the creators of the WARC format may limit the time or storage space required to archive an individual resource. As a result, only a truncated portion of the original resource may be stored in a WARC record. The WARC-Truncated field indicates that a chunk of the record has been truncated, and gives the reason for the truncation. -\WARC-Truncated\\.\reason-tokenWARC-Truncated
reason-token
future-reason
=\length\
\time\
I\disconnect\
[\unspecified\
I future-reason
=token
: Maximum length limit exceeded
: Maximum time limit exceeded
: Network disconnected
; unknown reason
For example, when fetching a resource of several GB in size, when the transmission time limit is reached, the fetch will be interrupted and part of the resource will be stored in a WARC record with this field.
The WARC-Truncated field can be used for any WARC record. The WARCContent-Length field should still report the actual truncated size of the record content block.
5.14WARC-Warcinfo-ID
WARC-Warcinfo-ID refers to the warcinf associated with this record. oThe WARC-Record-ID of the record. Normally, the WARC-Warcinfo-ID parameter is used when the available warcinfo record lacks context, for example when a single record is spread across several WARC files. WARC writing software (such as web crawlers) may choose to always record this parameter. The WARC-Warcinfo-ID-\WARC-Warcinfo-ID\\.\uriWARC-Warcinfo-ID field value overrides any association with a previously formed warcinfo\ record, thus providing a way to preserve the true association when records from different WARC files are merged. The WARC-Warcinfo-ID field may be used for any record type except "warcinfo\. 5.15WARC-Filename
WARC-Filename is the name of the file containing the current warcinfo record.8
Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.