Checklist
Description
I've started using AllenNLP since 2018, and I have already run thousands of NER benchmarks with it...since ELMo, and following with transformers, it's CrfTagger model has always yielded superior results in every possible benchmark for this task. However, since my research group trained different RoBERTa models for Portuguese, we have been conducting benchmarks comparing them with an existing BERT model, but we have been getting inconsistent results compared to other frameworks, such as huggingface's transformers.
Sorted results for AllenNLP grid search on CoNLL2003 using optuna (all berts' results are better than all the robertas'):

Sorted results for huggingface's transformers grid search on CoNLL2003 (all robertas' results are better than all the berts'):

I originally opened this as a question on stackoverflow, as suggested in the issues guidelines (additional details already provided there), but I have failed to discover the problem by myself. I have run several unit tests from AllenNLP, concerning the tokenizers and embedders, and couldn't notice anything wrong, but I'm betting something is definetely wrong in the training process, since the results are so inferior for non-BERT models.
Although I'm reporting details with the current release version, I'd like to point out that I had already run this CoNLL 2003 benchmark with RoBERTa/AllenNLP a long time ago too, so it's not something new. At the time the results for RoBERTa were quite below bert-base, but at the time I just thought RoBERTa wasn't competitive for NER (which is not true at all).
It is expected that the results using AllenNLP are at least as good as the ones obtained using huggingface's framework.
Related issues or possible duplicates
Environment
OS: Linux
Python version: 3.8.13
Output of pip freeze:
aiohttp==3.8.1
aiosignal==1.2.0
alembic==1.8.1
allennlp==2.10.0
allennlp-models==2.10.0
allennlp-optuna==0.1.7
asttokens==2.0.8
async-timeout==4.0.2
attrs==21.2.0
autopage==0.5.1
backcall==0.2.0
base58==2.1.1
blis==0.7.8
bokeh==2.4.3
boto3==1.24.67
botocore==1.27.67
cached-path==1.1.5
cachetools==5.2.0
catalogue==2.0.8
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
charset-normalizer==2.1.1
click==8.1.3
cliff==4.0.0
cloudpickle==2.2.0
cmaes==0.8.2
cmd2==2.4.2
colorama==0.4.5
colorlog==6.7.0
commonmark==0.9.1
conllu==4.4.2
converters-datalawyer==0.1.10
cvxopt==1.2.7
cvxpy==1.2.1
cycler==0.11.0
cymem==2.0.6
Cython==0.29.32
datasets==2.4.0
debugpy==1.6.3
decorator==5.1.1
deprecation==2.1.0
dill==0.3.5.1
dkpro-cassis==0.7.2
docker-pycreds==0.4.0
ecos==2.0.10
elasticsearch==7.13.0
emoji==2.0.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl
entrypoints==0.4
executing==1.0.0
fairscale==0.4.6
filelock==3.7.1
fire==0.4.0
fonttools==4.37.1
frozenlist==1.3.1
fsspec==2022.8.2
ftfy==6.1.1
future==0.18.2
gensim==4.2.0
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.8.2
google-auth==2.11.0
google-cloud-core==2.3.2
google-cloud-storage==2.5.0
google-crc32c==1.5.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.4
greenlet==1.1.3
h5py==3.7.0
hdbscan==0.8.28
huggingface-hub==0.8.1
hyperopt==0.2.7
idna==3.3
importlib-metadata==4.12.0
importlib-resources==5.4.0
inceptalytics==0.1.0
iniconfig==1.1.1
ipykernel==6.15.2
ipython==8.5.0
jedi==0.18.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.1.0
jsonnet==0.18.0
jupyter-core==4.11.1
jupyter_client==7.3.5
kiwisolver==1.4.4
krippendorff==0.5.1
langcodes==3.3.0
llvmlite==0.39.1
lmdb==1.3.0
lxml==4.9.1
Mako==1.2.2
MarkupSafe==2.1.1
matplotlib==3.5.3
matplotlib-inline==0.1.6
more-itertools==8.12.0
multidict==6.0.2
multiprocess==0.70.13
murmurhash==1.0.8
nest-asyncio==1.5.5
networkx==2.8.6
nltk==3.7
numba==0.56.2
numpy==1.23.3
optuna==2.10.1
osqp==0.6.2.post5
overrides==6.2.0
packaging==21.3
pandas==1.4.4
parso==0.8.3
pathtools==0.1.2
pathy==0.6.2
pbr==5.10.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
pluggy==1.0.0
preshed==3.0.7
prettytable==3.4.1
promise==2.3
prompt-toolkit==3.0.31
protobuf==3.20.0
psutil==5.9.2
pt-core-news-sm @ https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.3.0/pt_core_news_sm-3.3.0-py3-none-any.whl
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
py-rouge==1.1
py4j==0.10.9.7
pyannote.core==4.5
pyannote.database==4.1.3
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycaprio==0.2.1
pydantic==1.8.2
pygamma-agreement==0.5.6
Pygments==2.13.0
pympi-ling==1.70.2
pyparsing==3.0.9
pyperclip==1.8.2
pytest==7.1.3
python-dateutil==2.8.2
pytz==2022.2.1
PyYAML==6.0
pyzmq==23.2.1
qdldl==0.1.5.post2
regex==2022.8.17
requests==2.28.1
requests-toolbelt==0.9.1
responses==0.18.0
rich==12.1.0
rsa==4.9
s3transfer==0.6.0
sacremoses==0.0.53
scikit-learn==1.1.2
scipy==1.9.1
scs==3.2.0
seaborn==0.12.0
sentence-transformers==2.2.2
sentencepiece==0.1.97
sentry-sdk==1.9.8
seqeval==1.2.2
setproctitle==1.3.2
shellingham==1.5.0
shortuuid==1.0.9
simplejson==3.17.6
six==1.16.0
sklearn==0.0
smart-open==5.2.1
smmap==5.0.0
sortedcontainers==2.4.0
spacy==3.3.1
spacy-legacy==3.0.10
spacy-loggers==1.0.3
split-datalawyer==0.1.80
SQLAlchemy==1.4.41
srsly==2.4.4
stack-data==0.5.0
stanza==1.4.0
stevedore==4.0.0
tensorboardX==2.5.1
termcolor==1.1.0
TextGrid==1.5
thinc==8.0.17
threadpoolctl==3.1.0
tokenizers==0.12.1
tomli==2.0.1
toposort==1.7
torch==1.13.0.dev20220911+cu117
torchvision==0.14.0.dev20220911+cu117
tornado==6.2
tqdm==4.64.1
traitlets==5.3.0
transformers==4.21.3
typer==0.4.2
typing_extensions==4.3.0
umap==0.1.1
Unidecode==1.3.4
urllib3==1.26.12
wandb==0.12.21
wasabi==0.10.1
wcwidth==0.2.5
word2number==1.1
xxhash==3.0.0
yarl==1.8.1
zipp==3.8.1
Steps to reproduce
I'm attaching some parameters I used for running the CoNLL 2003 grid search.
Example source:
export BATCH_SIZE=8
export EPOCHS=10
export gradient_accumulation_steps=4
export dropout=0.2
export weight_decay=0
export seed=42
allennlp tune \
optuna_conll2003.jsonnet \
optuna-grid-search-conll2003-hparams.json \
--optuna-param-path optuna-grid-search-conll2003.json \
--serialization-dir /models/conll2003/benchmark_allennlp \
--study-name benchmark-allennlp-models-conll2003 \
--metrics test_f1-measure-overall \
--direction maximize \
--skip-if-exists \
--n-trials $1
optuna_conll2003.jsonnet
optuna-grid-search-conll2003.json
optuna-grid-search-conll2003-hparams.json
Checklist
mainbranch of AllenNLP.pip freeze.Description
I've started using AllenNLP since 2018, and I have already run thousands of NER benchmarks with it...since ELMo, and following with transformers, it's CrfTagger model has always yielded superior results in every possible benchmark for this task. However, since my research group trained different RoBERTa models for Portuguese, we have been conducting benchmarks comparing them with an existing BERT model, but we have been getting inconsistent results compared to other frameworks, such as huggingface's transformers.
Sorted results for AllenNLP grid search on CoNLL2003 using optuna (all berts' results are better than all the robertas'):


Sorted results for huggingface's transformers grid search on CoNLL2003 (all robertas' results are better than all the berts'):
I originally opened this as a question on stackoverflow, as suggested in the issues guidelines (additional details already provided there), but I have failed to discover the problem by myself. I have run several unit tests from AllenNLP, concerning the tokenizers and embedders, and couldn't notice anything wrong, but I'm betting something is definetely wrong in the training process, since the results are so inferior for non-BERT models.
Although I'm reporting details with the current release version, I'd like to point out that I had already run this CoNLL 2003 benchmark with RoBERTa/AllenNLP a long time ago too, so it's not something new. At the time the results for RoBERTa were quite below bert-base, but at the time I just thought RoBERTa wasn't competitive for NER (which is not true at all).
It is expected that the results using AllenNLP are at least as good as the ones obtained using huggingface's framework.
Related issues or possible duplicates
Environment
OS: Linux
Python version: 3.8.13
Output of
pip freeze:Steps to reproduce
I'm attaching some parameters I used for running the CoNLL 2003 grid search.
Example source:
optuna_conll2003.jsonnet
optuna-grid-search-conll2003.json
optuna-grid-search-conll2003-hparams.json