Ikcust22 - French Russian Thai Chinese translation Baseline
preface
Seeing that everyone signed up actively, but the number of submissions was not large, I sorted out the next part of the code as the baseline of the competition for everyone to score quickly. This project uses transformer_base, trained 50 rounds in 6 languages of China < - > France, Russia and Thailand. I wish you all good results in the competition! If it's a brother, roll me!
Project address: PaddleSeq
to update:
If there is a bug or something difficult to use, you are welcome to criticize and correct it!
9/6:
1. Fix - arch loading error bug, thank you Three years old Boss's correction!
2. Shall: Three years old and Li Yu Listens to the Moon The suggestions of the two leaders simplified the use of recovery training, starting from the original designation-
resume, last epoch and last step become yaml directly loaded in the weight directory, i.e. - c ckpt_dir/model.yaml, see 3.3 recovery training for details.
3. Should Li Yu Listens to the Moon According to the suggestion of the boss, a simple version, silky one click operation, has been added:
1, Event introduction
2022 IKCEST's Fourth "Belt and Road" International Big Data Competition
Under the guidance of the Chinese Academy of engineering, the University Computer Course Teaching Steering Committee of the Ministry of education and the Silk Road University Alliance, this big data competition is jointly sponsored by the international engineering science and technology knowledge center (IKCEST), China Engineering Science and technology knowledge center (CKCEST), Baidu and Xi'an Jiaotong University. It aims to look at countries along the "the Belt and Road" initiative, Dig out global big data AI cutting-edge talents through competition, achieve the goal of government industry university joint efforts to promote the research, application and development of big data industry, further consolidate the theoretical and practical basis of the event, and accelerate the cultivation of top AI innovative talents.
2, Data processing
2.1 Environmental installation
!unzip PaddleSeq.zip # !git clone https://github.com/MiuGod0126/PaddleSeq.git %cd PaddleSeq !pip install -r requirements.txt !unzip ../nmt_data_tools.zip -d ./ # !git clone https://github.com/MiuGod0126/nmt_data_tools.git !pip install -r nmt_data_tools/requirements.txt
2.2 Data processing
First, Chinese and Thai word segmentation (jieba/pythainlp), then apply subword nmt molecular words to all languages, and randomly divide 1000 from the training set as the verification set, which takes about 7 minutes to run.
!bash examples/ikcest22/scripts/prepare-ikcest22.sh
# Under datasets, raw is the original unprocessed text pair; tmp completes word segmentation and bpe; Finally, the training data is written to the BPE folder !ls datasets/ !head -n 1 datasets/raw/zh_th/train.zh !head -n 1 datasets/tmp/zh_th/train.tok.zh !head -n 1 datasets/tmp/zh_th/train.bpe.zh
bpe raw tmp Initial alien head portrait frame Avatar frame of the first visit to the alien world first@@ Temporary dissimilarity@@ circles@@ Avatar frame
# The data format of this project is as follows. The language pair is divided into two files: prefix.lang, where valid is 1000 sentences randomly divided, and code is a bpe vocabulary. !ls datasets/bpe/zh_th/
code.th test.th_zh.th train.th valid.th vocab.th code.zh test.zh_th.zh train.zh valid.zh vocab.zh
2.3 Data loading
# Load Zhongtai data from paddleseq.reader import prep_dataset, prep_loader from yacs.config import CfgNode cfg_path="examples/ikcest22/configs/zh_th.yaml" conf = CfgNode.load_cfg(open(cfg_path, encoding="utf-8")) dataset_train = prep_dataset(conf, mode="train") dataset_valid = prep_dataset(conf, mode="dev")
configuration file
SAVE: output data: has_target: True lang_embed: False lazy_load: False pad_factor: 8 pad_vocab: False special_token: ['<s>', '<pad>', '</s>', '<unk>'] src_bpe_path: datasets/bpe/zh_th/code.zh src_lang: zh test_pref: datasets/bpe/zh_th/test.zh_th tgt_bpe_path: datasets/bpe/zh_th/code.th tgt_lang: th train_pref: datasets/bpe/zh_th/train truecase_path: None use_binary: False use_moses_bpe: False valid_pref: datasets/bpe/zh_th/valid vocab_pref: datasets/bpe/zh_th/vocab ......
# Print 5 valid entries for data in dataset_valid[:5]: print(data)
{'id': 0, 'src': 'During the transformation BOSS Damage increased by 7%', 'tgt': 'ใน ช่วง กลายร่าง ดา เม จ ที่ ทำ ใส่ BOSS ทั้งหมด เพิ่มขึ้น 7 %'} {'id': 1, 'src': 'I'm getting bored, rats! Try it. It's from Kan@@ Tru's highest scientific power!', 'tgt': 'ข้า เริ่ม รำ คาน แล้ว นะ พวก หนู ! มา ลอง พลัง แห่ง เทคโนโลยี สูงสุด ของ คัท@@ ลู@@ !'} {'id': 2, 'src': 'And the back is with shoulder@@ The design of the line, which is lazy and a little neat, has made a super long model, which looks slender.', 'tgt': 'การ ออกแบบ ด้านหลัง ที่ มี เส้น ไหล่ ภายใน ความ@@ ขี้เกียจ ติด ความ เรียบร้อย บ้าง ทำเป็น แบบ ยาว พิเศษ ทำให้ เส้น โครงร่าง ของ คน เรียว ยาว'} {'id': 3, 'src': 'Cause magic damage to all enemies' targets. If the target is less than 5 people, the damage will be increased by 10 for each reduction% , At the same time, the critical hit probability decreases by 22% , It lasts for 2 rounds; If the target is a warrior class, dispel all its gain states.', 'tgt': 'ทำ M . DMG ใส่ ศัตรู ทั้งหมด เมื่อ เป้าหมาย น้อยกว่า 5 คน ทุกครั้งที่ ลด 1 คน DMG เพิ่ม 10 % ขณะเดียวกัน อัตรา CRIT ลด 22 % ต่อเนื่อง 2 รอบ ; หาก เป้าหมาย เป็น นักรบ ขับไล่ บัฟ ทั้งหมด ของ เป้าหมาย'} {'id': 4, 'src': 'Cook@@ Cooking and preparation@@ food', 'tgt': 'ปรุง@@ อาหาร และ เตรียม อาหาร'}
# tokens to ids, group batch train_loader = prep_loader(conf, dataset_train, mode="train" ,multi_process=False) valid_loader = prep_loader(conf, dataset_valid, mode="dev",multi_process=False)
for batch_idx,batch_data in enumerate(valid_loader): samples_id, src_tokens, prev_tokens, tgt_tokens = batch_data print(f"samples_id:{samples_id.shape} , src_tokens:{src_tokens.shape}, prev_tokens:{prev_tokens.shape}, tgt_tokens:{tgt_tokens.shape}") if batch_idx>4: break
samples_id:[312] , src_tokens:[312, 6], prev_tokens:[312, 13], tgt_tokens:[312, 13, 1] samples_id:[240] , src_tokens:[240, 8], prev_tokens:[240, 16], tgt_tokens:[240, 16, 1] samples_id:[184] , src_tokens:[184, 13], prev_tokens:[184, 22], tgt_tokens:[184, 22, 1] samples_id:[120] , src_tokens:[120, 20], prev_tokens:[120, 31], tgt_tokens:[120, 31, 1] samples_id:[88] , src_tokens:[88, 36], prev_tokens:[88, 42], tgt_tokens:[88, 42, 1] samples_id:[48] , src_tokens:[48, 63], prev_tokens:[48, 77], tgt_tokens:[48, 77, 1]
3, Model training
3.1 model networking
Transformer is a paper Attention Is All You Need It is a brand-new network structure to complete sequence to sequence (Seq2Seq) learning tasks such as Machine Translation, which completely uses the Attention mechanism to achieve sequence to sequence modeling.
Used by default in this project transformer_base Structure, to change to transformer_big, please modify the model in the configuration file yaml_ name:
model: model_name: transformer_big
In addition, you can also specify the network structure in the command line, such as:
python paddleseq_cli/train.py -c examples/ikcest22/configs/zh_th.yaml --arch transformer_big
If you want to change the network structure and parameters, please add a new structure in the source code transformer.py , such as:
def transformer_big(is_test=False, pretrained_path=None, **kwargs): for cfg in cfgs: assert cfg in kwargs, f'missing argument:{cfg}' model_args = dict(encoder_layers=6, decoder_layers=6, d_model=1024, nheads=16, dim_feedforward=4096, **kwargs) model_args = base_architecture(model_args) model = _create_transformer('transformer_big', is_test, pretrained_path, model_args) return model
3.2 model loading
from paddleseq.models import build_model model = build_model(conf, is_test=False) print(model)
Running time: 4 Seconds 585 milliseconds TRAIN model transformer_base created! Transformer( (encoder): Encoder( (layers): LayerList( (0): EncoderLayer( (self_attn): MultiHeadAttentionWithInit( (q_proj): Linear(in_features=512, out_features=512, dtype=float32) (k_proj): Linear(in_features=512, out_features=512, dtype=float32) (v_proj): Linear(in_features=512, out_features=512, dtype=float32) (out_proj): Linear(in_features=512, out_features=512, dtype=float32) ) (norm1): LayerNorm(normalized_shape=[512], epsilon=1e-05) (norm2): LayerNorm(normalized_shape=[512], epsilon=1e-05) (dropout1): Dropout(p=0.3, axis=None, mode=upscale_in_train) (dropout2): Dropout(p=0.3, axis=None, mode=upscale_in_train) (mlp): Mlp( (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train) (linear1): Linear(in_features=512, out_features=2048, dtype=float32) (linear2): Linear(in_features=2048, out_features=512, dtype=float32)
# In addition, you can also load the model in the weight directory Yaml (ckpt_path needs to be modified to its own weight directory) # from paddleseq.models import build_model # ckpt_path="output/ckpt_zhth/epoch_final/" # model = build_model(ckpt_path, is_test=False) # print(model)
TRAIN model transformer_base created! Pretrained weight load from:output/ckpt_zhth/epoch_final/model.pdparams!
3.3 Sino Thai Training
This section takes Zhongtai as an example to introduce the training commands of paddleseq.
Note: If you want to submit quickly, you can run 3.4 without brain
# View profile !ls examples/ikcest22/configs/
fr_zh.yaml ru_zh.yaml th_zh.yaml zh_fr.yaml zh_ru.yaml zh_th.yaml
# Training Zhongtai 3epoch # !export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' # You can ignore some warn information !python paddleseq_cli/train.py -c examples/ikcest22/configs/zh_th.yaml --update-freq 4 --max-epoch 3
2022-09-04 18:39:24,058 | transformer_base_rank0: ----- Total of train set:99000 ,train batch: 407 [single gpu] 2022-09-04 18:39:24,104 | transformer_base_rank0: ----- Total of valid set:1000 ,valid batch: 7 [single gpu] 2022-09-04 18:39:24,105 | transformer_base_rank0: Load data cost 1.7559177875518799 seconds. 2022-09-04 18:40:44,284 | transformer_base_rank0: Now training epoch 2. LR=0.00005 2022-09-04 18:41:01,809 | transformer_base_rank0: Train| epoch:[2/3], step:[100/395], speed:5.71 step/s, loss:11.520, nll_loss:11.148, ppl:2269.24, bsz:1001.9, gnorm:3.339, num_updates:123, lr:0.000063212 2022-09-04 18:41:15,831 | transformer_base_rank0: Train| epoch:[2/3], step:[200/391], speed:7.13 step/s, loss:11.307, nll_loss:10.899, ppl:1909.37, bsz:1011.2, gnorm:3.999, num_updates:147, lr:0.000075710 2022-09-04 18:41:30,304 | transformer_base_rank0: Train| epoch:[2/3], step:[300/404], speed:6.91 step/s, loss:11.169, nll_loss:10.731, ppl:1700.19, bsz:979.0, gnorm:4.359, num_updates:176, lr:0.000088207 2022-09-04 18:41:44,636 | transformer_base_rank0: Train| epoch:[2/3], step:[400/405], speed:6.98 step/s, loss:11.061, nll_loss:10.599, ppl:1551.27, bsz:977.0, gnorm:4.157, num_updates:201, lr:0.000100705 100%|█████████████████████████████████████████████| 7/7 [00:07<00:00, 1.07it/s] 2022-09-04 18:41:54,192 | transformer_base_rank0: Eval | Avg loss: 10.350 | nll_loss:9.716 | ppl: 900.017 | Eval | BLEU Score: 1.239 current checkpoints: ['model_best_0.0', 'epoch_final', 'model_best_1.239'] 2022-09-04 18:42:01,859 | transformer_base_rank0: Epoch:[2] | Best Valid Bleu: [1.239] saved to output/ckpt_zhth/model_best_1.239! 2022-09-04 18:42:01,859 | transformer_base_rank0: Now training epoch 3. LR=0.00010 2022-09-04 18:42:18,927 | transformer_base_rank0: Train| epoch:[3/3], step:[100/402], speed:5.86 step/s, loss:10.527, nll_loss:9.950, ppl:989.44, bsz:983.4, gnorm:3.806, num_updates:226, lr:0.000114077 2022-09-04 18:42:33,549 | transformer_base_rank0: Train| epoch:[3/3], step:[200/431], speed:6.84 step/s, loss:10.486, nll_loss:9.902, ppl:957.00, bsz:917.0, gnorm:3.867, num_updates:265, lr:0.000126575 2022-09-04 18:42:48,090 | transformer_base_rank0: Train| epoch:[3/3], step:[300/416], speed:6.88 step/s, loss:10.350, nll_loss:9.747, ppl:859.50, bsz:951.3, gnorm:4.090, num_updates:283, lr:0.000139072 2022-09-04 18:43:03,210 | transformer_base_rank0: Train| epoch:[3/3], step:[400/403], speed:6.61 step/s, loss:10.284, nll_loss:9.671, ppl:815.06, bsz:980.2, gnorm:4.266, num_updates:301, lr:0.000151570 100%|█████████████████████████████████████████████| 7/7 [00:08<00:00, 1.09s/it] 2022-09-04 18:43:16,024 | transformer_base_rank0: Epoch:[3] | Best Valid Bleu: [2.758] saved to output/ckpt_zhth/model_best_2.758! current checkpoints: ['model_best_0.0', 'epoch_final', 'model_best_1.239', 'model_best_2.758']
Introduction to training parameters:
python paddleseq_cli/train.py -c xx.yaml \ --amp --ngpus 1 --update-freq 4 \ --max-epoch 50 --save-epoch 10 --save-dir output \ --pretrained ckpt --log-steps 100 --max-tokens 4096 \ --seed 1 --eval-beam -c: Configuration file path; --amp: Mixed precision training can accelerate training. (Use amp After, the gradient norm in the log gnorm It is a normal phenomenon that it will become particularly large, because the multiplication of scale); --ngpus: Used gpu quantity --update-freq: Gradient accumulation can simulate larger batch_size,So as to obtain better scores (when set to 4, the log num_updates=step/4,bsz=bsz*4); --max-epoch: Number of training rounds; --save-epoch: Number of rounds to save weight interval; --save-dir: Save path of model weight, log, visualization log and prediction output; --pretrained: Load pre trained weights,It is a directory, usually in output/ckpt The directory contains model.pdparams, model.pdopt and model.yaml; --log-steps: Frequency of printing logs; --max-tokens: each batch maximal tokens Count( source or target maximal tokens Number); --seed: Random seed; --eval-beam: default False,After specifying, it will be enabled during the evaluation of training beam search To generate a prediction result and calculate bleu Score, more than the default teacher forcing of argmax The output should be more accurate.
About recovery training:
Sometimes it is unavoidable to have to close the training and restart it. You can use recovery training, that is, load the model weights and optimizer states previously trained, and directly load them from the model.yaml of the weight directory:
ckpt_dir=output/ckpt_zhth/epoch_final python paddleseq_cli/train.py -c $ckpt_dir/model.yaml
For other parameters, see config.py In fact, most of the parameters are the same as those in yaml, and the parameters on the command line will overwrite those in yaml.
3.4 training (6 directions) ⭐
One click run through six way training, fast score!
# View the training script and modify the corresponding parameters by yourself !cat examples/ikcest22/scripts/train_all.sh
epochs=50 freq=4 # update frequence directions=("zh_th" "th_zh" "zh_fr" "fr_zh" "zh_ru" "ru_zh") export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' for direct in ${directions[@]} do echo "------------------------------------------------------------training ${direct}....------------------------------------------------------------" python paddleseq_cli/train.py -c examples/ikcest22/configs/${direct}.yaml --update-freq $freq --max-epoch $epochs done echo "all done"
# ≈10h !bash examples/ikcest22/scripts/train_all.sh
.----------------training zh_th.....---------------- INFO 2022-09-04 18:24:46,576 cloud_utils.py:122] get cluster from args:job_server:None pods:['rank:0 id:None addr:127.0.0.1 port:None visible_gpu:[] trainers:["gpu:[\'0\'] endpoint:127.0.0.1:58733 rank:0"]'] job_stage_flag:None hdfs:None ........................ 2022-09-04 18:24:52,306 | transformer_base_rank0: Now training epoch 1. LR=0.00000 2022-09-04 18:25:11,480 | transformer_base_rank0: Train| epoch:[1/50], step:[100/419], speed:5.22 step/s, loss:14.192, nll_loss:14.134, ppl:17973.68, bsz:944.0, gnorm:9.323, num_updates:25, lr:0.000012348 .----------------training th_zh.....---------------- ...............................
4, Model evaluation
4.1 evaluation: Zhongtai
# The weight directory contains configuration parameters, optimizer parameters, and model parameters # !ls output/ckpt_zhth/epoch_final/
model.args model.pdopt model.pdparams
# Evaluation validation set !python paddleseq_cli/generate.py -c examples/ikcest22/configs/zh_th.yaml --pretrained output/ckpt_zhth/epoch_final --test-pref datasets/bpe/zh_th/valid
2022-09-04 18:29:08 | INFO | paddleseq_cli.generate | Paddle BlEU Score:47.3688 Sacrebleu: BLEU = 47.81 70.5/55.5/49.6/47.4 (BP = 0.868 ratio = 0.876 hyp_len = 11084 ref_len = 12647) write to file output/result.txt success.
# Average n weights (n model_best_xx are required) !python scripts/average_checkpoints.py --inputs output/ckpt_zhth/ --output output/ckpt_zhth/avg2 --num-ckpts 2 !python paddleseq_cli/generate.py -c examples/ikcest22/configs/zh_th.yaml --pretrained output/ckpt_zhth/avg2 --test-pref datasets/bpe/zh_th/valid
4.2 Evaluation of 6-way
# View the evaluation script and modify the corresponding weight, such as output / CKPT_ zhth/epoch_ Replace final with output/ckpt_zhth/model_best_xxx !cat examples/ikcest22/scripts/evaluate_all.sh
directions=("zh_th" "th_zh" "zh_fr" "fr_zh" "zh_ru" "ru_zh") data_paths=("zh_th" "zh_th" "zh_fr" "zh_fr" "zh_ru" "zh_ru") ckpts=("output/ckpt_zhth/epoch_final" "output/ckpt_thzh/epoch_final" "output/ckpt_zhfr/epoch_final" "output/ckpt_frzh/epoch_final" "output/ckpt_zhru/epoch_final" "output/ckpt_ruzh/epoch_final") for ((i=0;i<${#directions[@]};i++)) do direct=${directions[$i]} ckpt=${ckpts[$i]} echo "------------------------------------------------------------evaluate ${direct}....------------------------------------------------------------" python paddleseq_cli/generate.py -c examples/ikcest22/configs/${direct}.yaml --pretrained $ckpt --test-pref datasets/bpe/${data_paths[$i]}/valid done echo "all done"
# 3min !bash examples/ikcest22/scripts/evaluate_all.sh
.----------------evaluate zh_th.....---------------- 2022-09-04 18:33:02 | INFO | paddleseq_cli.generate | Paddle BlEU Score:47.3688 Sacrebleu: BLEU = 47.81 70.5/55.5/49.6/47.4 (BP = 0.868 ratio = 0.876 hyp_len = 11084 ref_len = 12647) write to file output/result.txt success. .-----------------evaluate th_zh.....---------------- 2022-09-04 18:33:25 | INFO | paddleseq_cli.generate | Paddle BlEU Score:49.0684 Sacrebleu: BLEU = 49.46 63.7/49.7/45.1/43.4 (BP = 0.991 ratio = 0.991 hyp_len = 10534 ref_len = 10626) .----------------evaluate zh_fr....---------------- 2022-09-04 18:33:55 | INFO | paddleseq_cli.generate | Paddle BlEU Score:14.9049 Sacrebleu: BLEU = 17.84 47.2/22.6/13.5/8.8 (BP = 0.946 ratio = 0.947 hyp_len = 21648 ref_len = 22854) write to file output/result.txt success. .----------------evaluate fr_zh.....---------------- 2022-09-04 18:34:26 | INFO | paddleseq_cli.generate | Paddle BlEU Score:16.1178 Sacrebleu: BLEU = 16.33 48.4/20.4/10.9/6.6 (BP = 1.000 ratio = 1.006 hyp_len = 19290 ref_len = 19175) write to file output/result.txt success. .----------------evaluate zh_ru.....---------------- 2022-09-04 18:35:08 | INFO | paddleseq_cli.generate | Paddle BlEU Score:10.7678 Sacrebleu: BLEU = 14.89 39.9/17.8/10.6/6.8 (BP = 0.992 ratio = 0.992 hyp_len = 24192 ref_len = 24381) write to file output/result.txt success. .----------------evaluate ru_zh....---------------- 2022-09-04 18:35:49 | INFO | paddleseq_cli.generate | Paddle BlEU Score:16.8321 Sacrebleu: BLEU = 16.99 48.1/21.0/11.6/7.1 (BP = 1.000 ratio = 1.000 hyp_len = 23760 ref_len = 23753) write to file output/result.txt success. all done
5, Forecast submission
5.1 Sino Thai forecast
!python paddleseq_cli/generate.py -c examples/ikcest22/configs/zh_th.yaml --pretrained output/ckpt_zhth/epoch_final --only-src !cat output/generate.txt | grep -P "^H" | sort -V | cut -f 3- > zh_th.rst
!head zh_th.rst
ผ้า กระเป๋า ดำ บน ผนัง ใน ทะเลทราย ฝัง กระดูก คลาสสิก และ เสน่ห์ อัน หรูหรา ปล่อย ให้ เห็น ถึง อารมณ์ ดีงาม ที่ เรียบง่าย เก็บ เอว ยก สะโพก หรูหรา เพชร มินิ ที่ ละเอียด 203 LV . 7 ดาว กองทหาร ฝันร้าย · ยอดเยี่ยม ทำให้ โล่ ความ ยาว 229 LV . 4 ดาว อาวุธ ม่วง ดร อป เค วส ขั้น 7 - แหวน ชื่อ : บริษัท เซี่ยงไฮ้ หัว เช่อผิ่น เปียว เท สติ้ง เทคโนโลยี จำกัด เฮ ก ซะ วา เลน ต์
5.2 Prediction of 6-way ⭐
The last weight epoch is used by default_ Final, which can be modified to model_best, the results are directly packaged into trans_result.zip, you can submit it.
# The best evaluated weight on the validation set can be used instead of epoch_final !cat examples/ikcest22/scripts/generate_all.sh
directions=("zh_th" "th_zh" "zh_fr" "fr_zh" "zh_ru" "ru_zh") ckpts=("output/ckpt_zhth/epoch_final" "output/ckpt_thzh/epoch_final" "output/ckpt_zhfr/epoch_final" "output/ckpt_frzh/epoch_final" "output/ckpt_zhru/epoch_final" "output/ckpt_ruzh/epoch_final") for ((i=0;i<${#directions[@]};i++)) do direct=${directions[$i]} ckpt=${ckpts[$i]} echo "------------------------------------------------------------generate ${direct}....------------------------------------------------------------" python paddleseq_cli/generate.py -c examples/ikcest22/configs/${direct}.yaml --pretrained $ckpt --only-src cat output/generate.txt | grep -P "^H" | sort -V | cut -f 3- > ${direct}.rst done zip -r trans_result.zip *.rst echo "all done"
# 3min !bash examples/ikcest22/scripts/generate_all.sh
.----------------generate zh_th....---------------- write to file output/result.txt success. .----------------generate th_zh.....---------------- write to file output/result.txt success. .----------------generate zh_fr.....---------------- write to file output/result.txt success. .----------------generate fr_zh.....---------------- write to file output/result.txt success. .----------------generate zh_ru.....---------------- write to file output/result.txt success. .----------------generate ru_zh.....---------------- 2022-09-04 18:49:35 | INFO | paddleseq_cli.generate | configs: write to file output/result.txt success. adding: fr_zh.rst (deflated 59%) adding: ru_zh.rst (deflated 61%) adding: th_zh.rst (deflated 56%) adding: zh_fr.rst (deflated 64%) adding: zh_ru.rst (deflated 74%) adding: zh_th.rst (deflated 98%) all done
5.3 Submit results
Submission format:
1. Naming of translation result file
The result files of each translation direction shall be named as specified.
French Chinese direction: fr_zh.rst
Russian Chinese direction: ru_zh.rst
Thai Chinese direction: th_zh.rst
Chinese French direction: zh_fr.rst
Chinese Russian direction: zh_ru.rst
Chinese Thai Direction: zh_th.rst
2. Format of translation result file
The translation result file stores the target sentence by line.
3. Package and submit
Compress all translation result files into a single zip file. Refer to the following operations:
zip trans_result.zip *.rst
!ls ./trans_result*
./trans_result.zip
zip files, refer to the following operations:
zip trans_result.zip *.rst
!ls ./trans_result*
./trans_result.zip
generate_ The all.sh script automatically packages all the results, and directly transfers the trans_ Just submit the result.zip.
The project baseline:
Official baseline:
6, Summary
This project has completed data processing, loading, model training, evaluation, prediction and one click generation of submission files, which is convenient for everyone to be listed quickly. For further optimization, we can start from the following points:
1. Data: ① for the English and Russian texts, we have not done any processing, so we can try to use moses to tokenize and truecase; ② we can use back translation, self training and other methods to synthesize more data; ③ or we can use language models to mine data that are closer to the distribution of iktest data and train them with the original data; ④ Add data! Gaka!
2. Model: try other structures, such as transformer big, or change the parameters yourself;
3. Training: multi language training or self pre training;
4. Loss: use simcut, rdrop and other enhancement methods;
5. Prediction: ① simply modify the beam size; ② Or rearrange a plurality of results predicted by the forward model using a reverse translation model and a language model using noise channel rearrangement.
If there are any problems in the project, please join the competition group to discuss and make progress together!
Finally, haha, all champion players give me some star s!
PaddleSeq
Code reference:
This article is for handling
Original project link