Blame: docs/data_format.md - hankcs/HanLP

hankcs / HanLP UNCLAIMED

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

36223 0 0 Python

Normal View History Raw

104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00			`---`
			`jupytext:`
			`formats: ipynb,md:myst`
			`text_representation:`
			`extension: .md`
			`format_name: myst`
			`format_version: '0.8'`
			`jupytext_version: 1.4.2`
			`kernelspec:`
			`display_name: Python 3`
			`language: python`
			`name: python3`
			`---`

			`# Data Format`


			`## Input Format`

			`### RESTful Input`

			`#### Definition`

			To make a RESTful call, one needs to send a `json` HTTP POST request to the server, which contains at least a `text`
			field or a `tokens` field. The input to RESTful API is very flexible. It can be one of the following 3 formats:

			1. It can be a document of raw `str` filled into `text`. The server will split it into sentences.
			1. It can be a `list` of sentences, each sentence is a raw `str`, filled into `text`.
			1. It can be a `list` of tokenized sentences, each sentence is a list of `str` typed tokens, filled into `tokens`.

			```{eval-rst}
			`Additionally, fine-grained controls are performed with the arguments defined in`
			:meth:`hanlp_restful.HanLPClient.parse`.
Revise document 2021-01-06 13:20:13 -05:00			```
104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00

			`#### Examples`

			```shell script
Revise documentation 2022-04-20 12:55:08 -04:00			`curl -X 'POST' \`
			`'https://hanlp.hankcs.com/api/parse' \`
			`-H 'accept: application/json' \`
			`-H 'Content-Type: application/json' \`
			`-d '{`
			`"language": "zh",`
Revise documentation 2022-07-15 20:20:19 -04:00			`"text": "HanLP为生产环境带来次世代最先进的多语种NLP技术。晓美焰来到北京参观自然语义科技公司。"`
Revise documentation 2022-04-20 12:55:08 -04:00			`}'`
104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00			```

			`### Model Input`

Revise documentation 2021-02-05 12:52:33 -05:00			````{margin} How about training inputs?
			```{seealso}
			We mostly follow the conventional file format of each NLP task instead of re-inventing them. Thus, we use `.tsv` for tagging and
			`.conllu` for parsing etc. For more details, refer to [datasets](https://hanlp.hankcs.com/docs/api/hanlp/datasets/index.html).
			```
			````

Revise document 2021-01-06 13:20:13 -05:00			`The input format to models is specified per model and per task. Generally speaking, if a model has no tokenizer built in, then its input is`
Revise document 2021-01-03 02:07:15 -05:00			a sentence in `list[str]` form (a list of tokens), or multiple such sentences nested in a `list`.
104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00
			If a model has a tokenizer built in, each sentence is in `str` form.
			Additionally, you can use `skip_tasks='tok*'` to ask the model to use your tokenized inputs instead of tokenizing
Revise documentation 2021-01-22 21:06:04 -05:00			them, in which case, each of your sentence needs to be in `list[str]` form, as if there was no tokenizer.
104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00
			```{eval-rst}
			`For any model, its input is of sentence level, which means you have to split a document into sentences beforehand.`
			You may want to try :class:`~hanlp.components.eos.ngram.NgramSentenceBoundaryDetector` for sentence splitting.
			```

			`## Output Format`


			```{eval-rst}
			The outputs of both :class:`~hanlp_restful.HanLPClient` and
			:class:`~hanlp.components.mtl.multi_task_learning.MultiTaskLearning` are unified as the same
			:class:`~hanlp_common.document.Document` format.
			```

			`For example, the following RESTful codes will output such an instance.`

			```{code-cell} ipython3
			`:tags: [output_scroll]`
			`from hanlp_restful import HanLPClient`
			`HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None) # Fill in your auth`
Revise documentation 2022-07-15 20:20:19 -04:00			`print(HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。晓美焰来到北京立方庭参观自然语义科技公司。'))`
104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00			```

Revise document 2021-01-06 13:20:13 -05:00			The outputs above is represented as a `json` dictionary where each key is a task name and its value is
			`the output of the corresponding task.`
104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00			For each output, if it's a nested `list` then it contains multiple sentences otherwise it's just one single sentence.

			`We make the following naming convention of NLP tasks, each consists of 3 letters.`

			````{margin} How about annotations?
			```{seealso}
			`Each NLP task can exploit multiple datasets with their annotations, see our [annotations](annotations/index) for details.`
			```
			````

			`### Naming Convention`

			`\| key \| Task \| Chinese \|`
			`\| ---- \| ------------------------------------------------------------ \| ------------ \|`
			`\| tok \| Tokenization. Each element is a token. \| 分词 \|`
			`\| pos \| Part-of-Speech Tagging. Each element is a tag. \| 词性标注 \|`
			`\| lem \| Lemmatization. Each element is a lemma. \| 词干提取 \|`
			`\| fea \| Features of Universal Dependencies. Each element is a feature. \| 词法语法特征 \|`
Revise documentation 2021-01-22 21:06:04 -05:00			\| ner \| Named Entity Recognition. Each element is a tuple of `(entity, type, begin, end)`, where `end`s are exclusive offsets. \| 命名实体识别 \|
			\| dep \| Dependency Parsing. Each element is a tuple of `(head, relation)` where `head` starts with index `0` (which is `ROOT`). \| 依存句法分析 \|
104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00			`\| con \| Constituency Parsing. Each list is a bracketed constituent. \| 短语成分分析 \|`
Revise documentation 2021-01-22 21:06:04 -05:00			\| srl \| Semantic Role Labeling. Similar to `ner`, each element is a tuple of `(arg/pred, label, begin, end)`, where the predicate is labeled as `PRED`. \| 语义角色标注 \|
Revise document 2021-01-06 13:20:13 -05:00			\| sdp \| Semantic Dependency Parsing. Similar to `dep`, however each token can have any number (including zero) of heads and corresponding relations. \| 语义依存分析 \|
104 languages, 10 tasks, dual backends, HanLPv2.1 2020-02-16 10:58:06 -05:00			`\| amr \| Abstract Meaning Representation. Each AMR graph is represented as list of logical triples. See [AMR guidelines](https://github.com/amrisi/amr-guidelines/blob/master/amr.md#example). \| 抽象意义表示 \|`

Revise documentation 2021-01-22 21:06:04 -05:00			`When there are multiple models performing the same task, their keys are appended with a secondary identifier.`
			For example, `tok/fine` and `tok/corase` means a fine-grained tokenization model and a coarse-grained one respectively.