Tabiya Documentation
HomepageGithub
🇬🇧 English
  • Tabiya Documentation
🇬🇧 English
  • Welcome
  • Overview
    • About Tabiya
    • The Global Youth Employment Challenge
      • The Role of Labor Market Intermediation
      • Digital Platforms and AI in LMIC Labor Market Intermediation
  • Open-Source Tech for Labor Markets
  • Our Tech Stack
    • Inclusive Livelihoods Taxonomy
      • Methodology
      • Why ESCO?
      • Core Taxonomy
      • Open Taxonomy Platform
      • Taxonomy CSV Format
    • Livelihoods Classifier
      • Getting Started
      • Web Application
      • Datasets
      • Training
      • Advanced Topics
      • Contributing Guide
      • FAQs
      • Demo Video
    • Compass
      • Technical Overview
      • UX Evaluation
        • UX Testing Discussion Guide
      • Roadmap
Powered by GitBook
On this page
  • Train an Entity Extraction Model
  • Train an Entity Similarity Model
Export as PDF
  1. Our Tech Stack
  2. Livelihoods Classifier

Training

PreviousDatasetsNextAdvanced Topics

Last updated 2 months ago

Train your entity extraction model using PyTorch.

First, activate the virtual environment as explained .

Train an Entity Extraction Model

Configure the necessary hyperparameters in the config.json file. The defaults are:

{
    "model_name": "bert-base-cased",
    "crf": false,
    "dataset_path": "tabiya/job_ner_dataset",   
    "label_list": ["O", "B-Skill", "B-Qualification", "I-Domain", "I-Experience", "I-Qualification", "B-Occupation", "B-Domain", "I-Occupation", "I-Skill", "B-Experience"],
    "model_max_length": 128,
    "batch_size": 32,
    "learning_rate": 1e-4,
    "epochs": 4,
    "weight_decay": 0.01,
    "save": false,
    "output_path": "bert_job_ner"
}

To train the model, run the following script in the train directory:

python train.py

Train an Entity Similarity Model

Configure the necessary hyperparameters in the sbert_train function in the sbert_train.py file:

sbert_train(model_id='all-MiniLM-L6-v2', dataset_path='your/dataset/path', output_path='your/output/path')

To train the similarity model, run the following script in the train directory:

python sbert_train.py

The dataset should be formatted as a CSV file with two columns, such as 'title' and 'esco_label', where each row contains a pair of related textual data points to be used during the training process. Make sure there are no missing values in your dataset to ensure successful training of the model. Here's an example of how your CSV file might look:

title
esco_label

Senior Conflict Manager

public institution director

etc

etc

The training script is based on the .

More information can be found .

official HuggingFace token classification tutorial
here
here