The 2022 IKCEST Fourth "the Belt and Road" international big data competition Paddle baseline

Ikcust22 - French Russian Thai Chinese translation Baseline

preface

Seeing that everyone signed up actively, but the number of submissions was not large, I sorted out the next part of the code as the baseline of the competition for everyone to score quickly. This project uses transformer_base, trained 50 rounds in 6 languages of China < - > France, Russia and Thailand. I wish you all good results in the competition! If it's a brother, roll me!

Project address: PaddleSeq

to update:

If there is a bug or something difficult to use, you are welcome to criticize and correct it!

9/6:

1. Fix - arch loading error bug, thank you Three years old Boss's correction!

2. Shall: Three years old and Li Yu Listens to the Moon The suggestions of the two leaders simplified the use of recovery training, starting from the original designation-
resume, last epoch and last step become yaml directly loaded in the weight directory, i.e. - c ckpt_dir/model.yaml, see 3.3 recovery training for details.

3. Should Li Yu Listens to the Moon According to the suggestion of the boss, a simple version, silky one click operation, has been added:

1, Event introduction

2022 IKCEST's Fourth "Belt and Road" International Big Data Competition

Under the guidance of the Chinese Academy of engineering, the University Computer Course Teaching Steering Committee of the Ministry of education and the Silk Road University Alliance, this big data competition is jointly sponsored by the international engineering science and technology knowledge center (IKCEST), China Engineering Science and technology knowledge center (CKCEST), Baidu and Xi'an Jiaotong University. It aims to look at countries along the "the Belt and Road" initiative, Dig out global big data AI cutting-edge talents through competition, achieve the goal of government industry university joint efforts to promote the research, application and development of big data industry, further consolidate the theoretical and practical basis of the event, and accelerate the cultivation of top AI innovative talents.

2, Data processing

2.1 Environmental installation

!unzip PaddleSeq.zip
# !git clone  https://github.com/MiuGod0126/PaddleSeq.git
%cd PaddleSeq
!pip install -r requirements.txt
!unzip ../nmt_data_tools.zip -d ./
# !git clone https://github.com/MiuGod0126/nmt_data_tools.git
!pip install -r nmt_data_tools/requirements.txt

2.2 Data processing

First, Chinese and Thai word segmentation (jieba/pythainlp), then apply subword nmt molecular words to all languages, and randomly divide 1000 from the training set as the verification set, which takes about 7 minutes to run.

!bash examples/ikcest22/scripts/prepare-ikcest22.sh
# Under datasets, raw is the original unprocessed text pair; tmp completes word segmentation and bpe; Finally, the training data is written to the BPE folder
!ls datasets/
!head -n 1 datasets/raw/zh_th/train.zh
!head -n 1 datasets/tmp/zh_th/train.tok.zh
!head -n 1 datasets/tmp/zh_th/train.bpe.zh
bpe  raw  tmp
 Initial alien head portrait frame
 Avatar frame of the first visit to the alien world
 first@@ Temporary dissimilarity@@ circles@@ Avatar frame
# The data format of this project is as follows. The language pair is divided into two files: prefix.lang, where valid is 1000 sentences randomly divided, and code is a bpe vocabulary.
!ls datasets/bpe/zh_th/
code.th  test.th_zh.th	train.th  valid.th  vocab.th
code.zh  test.zh_th.zh	train.zh  valid.zh  vocab.zh

2.3 Data loading

# Load Zhongtai data
from paddleseq.reader import prep_dataset, prep_loader
from yacs.config import CfgNode
cfg_path="examples/ikcest22/configs/zh_th.yaml"
conf = CfgNode.load_cfg(open(cfg_path, encoding="utf-8"))
dataset_train = prep_dataset(conf, mode="train")
dataset_valid = prep_dataset(conf, mode="dev")

configuration file

SAVE: output
data:
  has_target: True
  lang_embed: False
  lazy_load: False
  pad_factor: 8
  pad_vocab: False
  special_token: ['<s>', '<pad>', '</s>', '<unk>']
  src_bpe_path: datasets/bpe/zh_th/code.zh
  src_lang: zh
  test_pref: datasets/bpe/zh_th/test.zh_th
  tgt_bpe_path: datasets/bpe/zh_th/code.th
  tgt_lang: th
  train_pref: datasets/bpe/zh_th/train
  truecase_path: None
  use_binary: False
  use_moses_bpe: False
  valid_pref: datasets/bpe/zh_th/valid
  vocab_pref: datasets/bpe/zh_th/vocab
 ......
# Print 5 valid entries
for data in dataset_valid[:5]:
    print(data)
{'id': 0, 'src': 'During the transformation BOSS Damage increased by 7%', 'tgt': 'ใน ช่วง กลายร่าง ดา เม จ ที่ ทำ ใส่ BOSS ทั้งหมด เพิ่มขึ้น 7 %'}
{'id': 1, 'src': 'I'm getting bored, rats! Try it. It's from Kan@@ Tru's highest scientific power!', 'tgt': 'ข้า เริ่ม รำ คาน แล้ว นะ พวก หนู ! มา ลอง พลัง แห่ง เทคโนโลยี สูงสุด ของ คัท@@ ลู@@ !'}
{'id': 2, 'src': 'And the back is with shoulder@@ The design of the line, which is lazy and a little neat, has made a super long model, which looks slender.', 'tgt': 'การ ออกแบบ ด้านหลัง ที่ มี เส้น ไหล่ ภายใน ความ@@ ขี้เกียจ ติด ความ เรียบร้อย บ้าง ทำเป็น แบบ ยาว พิเศษ ทำให้ เส้น โครงร่าง ของ คน เรียว ยาว'}
{'id': 3, 'src': 'Cause magic damage to all enemies' targets. If the target is less than 5 people, the damage will be increased by 10 for each reduction% , At the same time, the critical hit probability decreases by 22% , It lasts for 2 rounds; If the target is a warrior class, dispel all its gain states.', 'tgt': 'ทำ M . DMG ใส่ ศัตรู ทั้งหมด เมื่อ เป้าหมาย น้อยกว่า 5 คน ทุกครั้งที่ ลด 1 คน DMG เพิ่ม 10 % ขณะเดียวกัน อัตรา CRIT ลด 22 % ต่อเนื่อง 2 รอบ ; หาก เป้าหมาย เป็น นักรบ ขับไล่ บัฟ ทั้งหมด ของ เป้าหมาย'}
{'id': 4, 'src': 'Cook@@ Cooking and preparation@@ food', 'tgt': 'ปรุง@@ อาหาร และ เตรียม อาหาร'}
# tokens to ids, group batch
train_loader = prep_loader(conf, dataset_train, mode="train" ,multi_process=False)
valid_loader = prep_loader(conf, dataset_valid, mode="dev",multi_process=False)
for batch_idx,batch_data in enumerate(valid_loader):
    samples_id, src_tokens, prev_tokens, tgt_tokens = batch_data
    print(f"samples_id:{samples_id.shape} , src_tokens:{src_tokens.shape}, prev_tokens:{prev_tokens.shape}, tgt_tokens:{tgt_tokens.shape}")
    if batch_idx>4:
        break
samples_id:[312] , src_tokens:[312, 6], prev_tokens:[312, 13], tgt_tokens:[312, 13, 1]
samples_id:[240] , src_tokens:[240, 8], prev_tokens:[240, 16], tgt_tokens:[240, 16, 1]
samples_id:[184] , src_tokens:[184, 13], prev_tokens:[184, 22], tgt_tokens:[184, 22, 1]
samples_id:[120] , src_tokens:[120, 20], prev_tokens:[120, 31], tgt_tokens:[120, 31, 1]
samples_id:[88] , src_tokens:[88, 36], prev_tokens:[88, 42], tgt_tokens:[88, 42, 1]
samples_id:[48] , src_tokens:[48, 63], prev_tokens:[48, 77], tgt_tokens:[48, 77, 1]

3, Model training

3.1 model networking

Transformer is a paper Attention Is All You Need It is a brand-new network structure to complete sequence to sequence (Seq2Seq) learning tasks such as Machine Translation, which completely uses the Attention mechanism to achieve sequence to sequence modeling.

Used by default in this project transformer_base Structure, to change to transformer_big, please modify the model in the configuration file yaml_ name:

model:
  model_name: transformer_big

In addition, you can also specify the network structure in the command line, such as:

python paddleseq_cli/train.py -c examples/ikcest22/configs/zh_th.yaml --arch transformer_big

If you want to change the network structure and parameters, please add a new structure in the source code transformer.py , such as:

def transformer_big(is_test=False, pretrained_path=None, **kwargs):
    for cfg in cfgs: assert cfg in kwargs, f'missing argument:{cfg}'
    model_args = dict(encoder_layers=6,
                      decoder_layers=6,
                      d_model=1024,
                      nheads=16,
                      dim_feedforward=4096,
                      **kwargs)
    model_args = base_architecture(model_args)
    model = _create_transformer('transformer_big', is_test, pretrained_path, model_args)
    return model

3.2 model loading

from paddleseq.models import build_model
model = build_model(conf, is_test=False)
print(model)
Running time: 4 Seconds 585 milliseconds
TRAIN model transformer_base created!
Transformer(
  (encoder): Encoder(
    (layers): LayerList(
      (0): EncoderLayer(
        (self_attn): MultiHeadAttentionWithInit(
          (q_proj): Linear(in_features=512, out_features=512, dtype=float32)
          (k_proj): Linear(in_features=512, out_features=512, dtype=float32)
          (v_proj): Linear(in_features=512, out_features=512, dtype=float32)
          (out_proj): Linear(in_features=512, out_features=512, dtype=float32)
        )
        (norm1): LayerNorm(normalized_shape=[512], epsilon=1e-05)
        (norm2): LayerNorm(normalized_shape=[512], epsilon=1e-05)
        (dropout1): Dropout(p=0.3, axis=None, mode=upscale_in_train)
        (dropout2): Dropout(p=0.3, axis=None, mode=upscale_in_train)
        (mlp): Mlp(
          (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
          (linear1): Linear(in_features=512, out_features=2048, dtype=float32)
          (linear2): Linear(in_features=2048, out_features=512, dtype=float32)
# In addition, you can also load the model in the weight directory Yaml (ckpt_path needs to be modified to its own weight directory)
# from paddleseq.models import build_model
# ckpt_path="output/ckpt_zhth/epoch_final/"
# model = build_model(ckpt_path, is_test=False)
# print(model)
TRAIN model transformer_base created!
Pretrained weight load from:output/ckpt_zhth/epoch_final/model.pdparams!

3.3 Sino Thai Training

This section takes Zhongtai as an example to introduce the training commands of paddleseq.

Note: If you want to submit quickly, you can run 3.4 without brain

# View profile
!ls examples/ikcest22/configs/
fr_zh.yaml  ru_zh.yaml	th_zh.yaml  zh_fr.yaml	zh_ru.yaml  zh_th.yaml
# Training Zhongtai 3epoch
# !export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning'  # You can ignore some warn information
!python paddleseq_cli/train.py -c examples/ikcest22/configs/zh_th.yaml --update-freq 4  --max-epoch 3
2022-09-04 18:39:24,058 | transformer_base_rank0: ----- Total of train set:99000 ,train batch: 407 [single gpu]
2022-09-04 18:39:24,104 | transformer_base_rank0: ----- Total of valid set:1000 ,valid batch: 7 [single gpu]
2022-09-04 18:39:24,105 | transformer_base_rank0: Load data cost 1.7559177875518799 seconds.
2022-09-04 18:40:44,284 | transformer_base_rank0: Now training epoch 2. LR=0.00005
2022-09-04 18:41:01,809 | transformer_base_rank0: Train| epoch:[2/3], step:[100/395], speed:5.71 step/s, loss:11.520, nll_loss:11.148, ppl:2269.24, bsz:1001.9, gnorm:3.339, num_updates:123, lr:0.000063212
2022-09-04 18:41:15,831 | transformer_base_rank0: Train| epoch:[2/3], step:[200/391], speed:7.13 step/s, loss:11.307, nll_loss:10.899, ppl:1909.37, bsz:1011.2, gnorm:3.999, num_updates:147, lr:0.000075710
2022-09-04 18:41:30,304 | transformer_base_rank0: Train| epoch:[2/3], step:[300/404], speed:6.91 step/s, loss:11.169, nll_loss:10.731, ppl:1700.19, bsz:979.0, gnorm:4.359, num_updates:176, lr:0.000088207
2022-09-04 18:41:44,636 | transformer_base_rank0: Train| epoch:[2/3], step:[400/405], speed:6.98 step/s, loss:11.061, nll_loss:10.599, ppl:1551.27, bsz:977.0, gnorm:4.157, num_updates:201, lr:0.000100705
100%|█████████████████████████████████████████████| 7/7 [00:07<00:00,  1.07it/s]
2022-09-04 18:41:54,192 | transformer_base_rank0: Eval | Avg loss: 10.350 | nll_loss:9.716 | ppl: 900.017 | Eval | BLEU Score: 1.239
current checkpoints: ['model_best_0.0', 'epoch_final', 'model_best_1.239']
2022-09-04 18:42:01,859 | transformer_base_rank0: Epoch:[2] | Best Valid Bleu: [1.239] saved to output/ckpt_zhth/model_best_1.239!
2022-09-04 18:42:01,859 | transformer_base_rank0: Now training epoch 3. LR=0.00010
2022-09-04 18:42:18,927 | transformer_base_rank0: Train| epoch:[3/3], step:[100/402], speed:5.86 step/s, loss:10.527, nll_loss:9.950, ppl:989.44, bsz:983.4, gnorm:3.806, num_updates:226, lr:0.000114077
2022-09-04 18:42:33,549 | transformer_base_rank0: Train| epoch:[3/3], step:[200/431], speed:6.84 step/s, loss:10.486, nll_loss:9.902, ppl:957.00, bsz:917.0, gnorm:3.867, num_updates:265, lr:0.000126575
2022-09-04 18:42:48,090 | transformer_base_rank0: Train| epoch:[3/3], step:[300/416], speed:6.88 step/s, loss:10.350, nll_loss:9.747, ppl:859.50, bsz:951.3, gnorm:4.090, num_updates:283, lr:0.000139072
2022-09-04 18:43:03,210 | transformer_base_rank0: Train| epoch:[3/3], step:[400/403], speed:6.61 step/s, loss:10.284, nll_loss:9.671, ppl:815.06, bsz:980.2, gnorm:4.266, num_updates:301, lr:0.000151570
100%|█████████████████████████████████████████████| 7/7 [00:08<00:00,  1.09s/it]
2022-09-04 18:43:16,024 | transformer_base_rank0: Epoch:[3] | Best Valid Bleu: [2.758] saved to output/ckpt_zhth/model_best_2.758!
current checkpoints: ['model_best_0.0', 'epoch_final', 'model_best_1.239', 'model_best_2.758']

Introduction to training parameters:

python paddleseq_cli/train.py -c xx.yaml \
        --amp --ngpus 1 --update-freq 4 \
        --max-epoch 50 --save-epoch 10 --save-dir output \
        --pretrained ckpt --log-steps 100 --max-tokens 4096  \
        --seed 1 --eval-beam
        
-c:  Configuration file path;

--amp:  Mixed precision training can accelerate training. (Use amp After, the gradient norm in the log gnorm It is a normal phenomenon that it will become particularly large, because the multiplication of scale);

--ngpus: Used gpu quantity

--update-freq: Gradient accumulation can simulate larger batch_size,So as to obtain better scores (when set to 4, the log num_updates=step/4,bsz=bsz*4);

--max-epoch:  Number of training rounds;

--save-epoch:  Number of rounds to save weight interval;

--save-dir:  Save path of model weight, log, visualization log and prediction output;

--pretrained:  Load pre trained weights,It is a directory, usually in output/ckpt The directory contains model.pdparams, model.pdopt and model.yaml;

--log-steps:  Frequency of printing logs;

--max-tokens:  each batch maximal tokens Count( source or target maximal tokens Number);

--seed:  Random seed;

--eval-beam:  default False,After specifying, it will be enabled during the evaluation of training beam search To generate a prediction result and calculate bleu Score, more than the default teacher forcing of argmax The output should be more accurate.

About recovery training:

Sometimes it is unavoidable to have to close the training and restart it. You can use recovery training, that is, load the model weights and optimizer states previously trained, and directly load them from the model.yaml of the weight directory:

ckpt_dir=output/ckpt_zhth/epoch_final
python paddleseq_cli/train.py -c $ckpt_dir/model.yaml 

For other parameters, see config.py In fact, most of the parameters are the same as those in yaml, and the parameters on the command line will overwrite those in yaml.

3.4 training (6 directions) ⭐

One click run through six way training, fast score!

# View the training script and modify the corresponding parameters by yourself
!cat examples/ikcest22/scripts/train_all.sh
epochs=50
freq=4 # update frequence

directions=("zh_th" "th_zh" "zh_fr" "fr_zh" "zh_ru" "ru_zh")
export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning'
for direct in ${directions[@]}
  do
      echo "------------------------------------------------------------training ${direct}....------------------------------------------------------------"
      python paddleseq_cli/train.py -c examples/ikcest22/configs/${direct}.yaml --update-freq $freq --max-epoch $epochs
  done


echo "all done"
# ≈10h
!bash examples/ikcest22/scripts/train_all.sh
.----------------training zh_th.....----------------
INFO 2022-09-04 18:24:46,576 cloud_utils.py:122] get cluster from args:job_server:None pods:['rank:0 id:None addr:127.0.0.1 port:None visible_gpu:[] trainers:["gpu:[\'0\'] endpoint:127.0.0.1:58733 rank:0"]'] job_stage_flag:None hdfs:None
........................
2022-09-04 18:24:52,306 | transformer_base_rank0: Now training epoch 1. LR=0.00000
2022-09-04 18:25:11,480 | transformer_base_rank0: Train| epoch:[1/50], step:[100/419], speed:5.22 step/s, loss:14.192, nll_loss:14.134, ppl:17973.68, bsz:944.0, gnorm:9.323, num_updates:25, lr:0.000012348
.----------------training th_zh.....----------------
...............................

4, Model evaluation

4.1 evaluation: Zhongtai

# The weight directory contains configuration parameters, optimizer parameters, and model parameters
# !ls output/ckpt_zhth/epoch_final/
model.args  model.pdopt  model.pdparams
# Evaluation validation set
!python paddleseq_cli/generate.py -c examples/ikcest22/configs/zh_th.yaml --pretrained output/ckpt_zhth/epoch_final --test-pref datasets/bpe/zh_th/valid
2022-09-04 18:29:08 | INFO | paddleseq_cli.generate | Paddle BlEU Score:47.3688
Sacrebleu: BLEU = 47.81 70.5/55.5/49.6/47.4 (BP = 0.868 ratio = 0.876 hyp_len = 11084 ref_len = 12647)
write to file output/result.txt success.
# Average n weights (n model_best_xx are required)
!python scripts/average_checkpoints.py --inputs output/ckpt_zhth/ --output output/ckpt_zhth/avg2 --num-ckpts  2
!python paddleseq_cli/generate.py -c examples/ikcest22/configs/zh_th.yaml --pretrained output/ckpt_zhth/avg2  --test-pref datasets/bpe/zh_th/valid

4.2 Evaluation of 6-way

# View the evaluation script and modify the corresponding weight, such as output / CKPT_ zhth/epoch_ Replace final with output/ckpt_zhth/model_best_xxx
!cat examples/ikcest22/scripts/evaluate_all.sh
directions=("zh_th" "th_zh" "zh_fr" "fr_zh" "zh_ru" "ru_zh")
data_paths=("zh_th" "zh_th" "zh_fr" "zh_fr" "zh_ru" "zh_ru")
ckpts=("output/ckpt_zhth/epoch_final"
        "output/ckpt_thzh/epoch_final"
        "output/ckpt_zhfr/epoch_final"
        "output/ckpt_frzh/epoch_final"
        "output/ckpt_zhru/epoch_final"
        "output/ckpt_ruzh/epoch_final")

for ((i=0;i<${#directions[@]};i++))
  do  
      direct=${directions[$i]}
      ckpt=${ckpts[$i]}
      echo "------------------------------------------------------------evaluate ${direct}....------------------------------------------------------------"
      python paddleseq_cli/generate.py -c examples/ikcest22/configs/${direct}.yaml --pretrained $ckpt --test-pref datasets/bpe/${data_paths[$i]}/valid
  done

echo "all done"
# 3min
!bash examples/ikcest22/scripts/evaluate_all.sh
.----------------evaluate zh_th.....----------------
2022-09-04 18:33:02 | INFO | paddleseq_cli.generate | Paddle BlEU Score:47.3688
Sacrebleu: BLEU = 47.81 70.5/55.5/49.6/47.4 (BP = 0.868 ratio = 0.876 hyp_len = 11084 ref_len = 12647)
write to file output/result.txt success.
.-----------------evaluate th_zh.....----------------
2022-09-04 18:33:25 | INFO | paddleseq_cli.generate | Paddle BlEU Score:49.0684
Sacrebleu: BLEU = 49.46 63.7/49.7/45.1/43.4 (BP = 0.991 ratio = 0.991 hyp_len = 10534 ref_len = 10626)
.----------------evaluate zh_fr....----------------
2022-09-04 18:33:55 | INFO | paddleseq_cli.generate | Paddle BlEU Score:14.9049
Sacrebleu: BLEU = 17.84 47.2/22.6/13.5/8.8 (BP = 0.946 ratio = 0.947 hyp_len = 21648 ref_len = 22854)
write to file output/result.txt success.
.----------------evaluate fr_zh.....----------------
2022-09-04 18:34:26 | INFO | paddleseq_cli.generate | Paddle BlEU Score:16.1178
Sacrebleu: BLEU = 16.33 48.4/20.4/10.9/6.6 (BP = 1.000 ratio = 1.006 hyp_len = 19290 ref_len = 19175)
write to file output/result.txt success.
.----------------evaluate zh_ru.....----------------
2022-09-04 18:35:08 | INFO | paddleseq_cli.generate | Paddle BlEU Score:10.7678
Sacrebleu: BLEU = 14.89 39.9/17.8/10.6/6.8 (BP = 0.992 ratio = 0.992 hyp_len = 24192 ref_len = 24381)
write to file output/result.txt success.
.----------------evaluate ru_zh....----------------
2022-09-04 18:35:49 | INFO | paddleseq_cli.generate | Paddle BlEU Score:16.8321
Sacrebleu: BLEU = 16.99 48.1/21.0/11.6/7.1 (BP = 1.000 ratio = 1.000 hyp_len = 23760 ref_len = 23753)
write to file output/result.txt success.
all done

5, Forecast submission

5.1 Sino Thai forecast

!python paddleseq_cli/generate.py -c examples/ikcest22/configs/zh_th.yaml --pretrained output/ckpt_zhth/epoch_final --only-src
!cat output/generate.txt | grep -P "^H" | sort -V | cut -f 3- > zh_th.rst
!head zh_th.rst
ผ้า กระเป๋า ดำ บน ผนัง ใน ทะเลทราย ฝัง กระดูก คลาสสิก และ เสน่ห์ อัน หรูหรา ปล่อย ให้ เห็น ถึง อารมณ์ ดีงาม ที่ เรียบง่าย
เก็บ เอว ยก สะโพก หรูหรา
เพชร มินิ ที่ ละเอียด
203 LV . 7 ดาว
กองทหาร ฝันร้าย · ยอดเยี่ยม
ทำให้ โล่ ความ ยาว
229 LV . 4 ดาว
อาวุธ ม่วง ดร อป เค วส ขั้น 7 - แหวน
ชื่อ : บริษัท เซี่ยงไฮ้ หัว เช่อผิ่น เปียว เท สติ้ง เทคโนโลยี จำกัด
เฮ ก ซะ วา เลน ต์

5.2 Prediction of 6-way ⭐

The last weight epoch is used by default_ Final, which can be modified to model_best, the results are directly packaged into trans_result.zip, you can submit it.

# The best evaluated weight on the validation set can be used instead of epoch_final
!cat examples/ikcest22/scripts/generate_all.sh
directions=("zh_th" "th_zh" "zh_fr" "fr_zh" "zh_ru" "ru_zh")
ckpts=("output/ckpt_zhth/epoch_final"
        "output/ckpt_thzh/epoch_final"
        "output/ckpt_zhfr/epoch_final"
        "output/ckpt_frzh/epoch_final"
        "output/ckpt_zhru/epoch_final"
        "output/ckpt_ruzh/epoch_final")

for ((i=0;i<${#directions[@]};i++))
  do  
      direct=${directions[$i]}
      ckpt=${ckpts[$i]}
      echo "------------------------------------------------------------generate ${direct}....------------------------------------------------------------"
      python paddleseq_cli/generate.py -c examples/ikcest22/configs/${direct}.yaml --pretrained $ckpt --only-src
      cat output/generate.txt | grep -P "^H" | sort -V | cut -f 3- > ${direct}.rst
  done

zip -r trans_result.zip *.rst

echo "all done"
# 3min
!bash examples/ikcest22/scripts/generate_all.sh
.----------------generate zh_th....----------------
write to file output/result.txt success.
.----------------generate th_zh.....----------------
write to file output/result.txt success.
.----------------generate zh_fr.....----------------
write to file output/result.txt success.
.----------------generate fr_zh.....----------------
write to file output/result.txt success.
.----------------generate zh_ru.....----------------
write to file output/result.txt success.
.----------------generate ru_zh.....----------------
2022-09-04 18:49:35 | INFO | paddleseq_cli.generate | configs:
write to file output/result.txt success.
  adding: fr_zh.rst (deflated 59%)
  adding: ru_zh.rst (deflated 61%)
  adding: th_zh.rst (deflated 56%)
  adding: zh_fr.rst (deflated 64%)
  adding: zh_ru.rst (deflated 74%)
  adding: zh_th.rst (deflated 98%)
all done

5.3 Submit results

Submission format:

1. Naming of translation result file

The result files of each translation direction shall be named as specified.

French Chinese direction: fr_zh.rst

Russian Chinese direction: ru_zh.rst

Thai Chinese direction: th_zh.rst

Chinese French direction: zh_fr.rst

Chinese Russian direction: zh_ru.rst

Chinese Thai Direction: zh_th.rst

2. Format of translation result file

The translation result file stores the target sentence by line.

3. Package and submit

Compress all translation result files into a single zip file. Refer to the following operations:

zip trans_result.zip *.rst
!ls ./trans_result*
./trans_result.zip

zip files, refer to the following operations:

zip trans_result.zip *.rst
!ls ./trans_result*
./trans_result.zip

generate_ The all.sh script automatically packages all the results, and directly transfers the trans_ Just submit the result.zip.

Submission address

The project baseline:

Official baseline:

6, Summary

This project has completed data processing, loading, model training, evaluation, prediction and one click generation of submission files, which is convenient for everyone to be listed quickly. For further optimization, we can start from the following points:

1. Data: ① for the English and Russian texts, we have not done any processing, so we can try to use moses to tokenize and truecase; ② we can use back translation, self training and other methods to synthesize more data; ③ or we can use language models to mine data that are closer to the distribution of iktest data and train them with the original data; ④ Add data! Gaka!

2. Model: try other structures, such as transformer big, or change the parameters yourself;

3. Training: multi language training or self pre training;

4. Loss: use simcut, rdrop and other enhancement methods;

5. Prediction: ① simply modify the beam size; ② Or rearrange a plurality of results predicted by the forward model using a reverse translation model and a language model using noise channel rearrangement.

If there are any problems in the project, please join the competition group to discuss and make progress together!

Finally, haha, all champion players give me some star s!

PaddleSeq

Code reference:

1.PaddleNLP

2.fairseq

3.ConvS2S_Paddle

4.STACL_Paddle

This article is for handling
Original project link

Tags: Big Data AI paddle

Posted by dstantdog3 on Wed, 14 Sep 2022 01:58:58 +0930