AllenNLP biased towards BERT

## Checklist



- [ x ] I have verified that the issue exists against the `main` branch of AllenNLP.
- [ x ] I have read the relevant section in the [contribution guide](https://github.com/allenai/allennlp/blob/main/CONTRIBUTING.md#bug-fixes-and-new-features) on reporting bugs.
- [ x ] I have checked the [issues list](https://github.com/allenai/allennlp/issues) for similar or identical bug reports.
- [ x ] I have checked the [pull requests list](https://github.com/allenai/allennlp/pulls) for existing proposed fixes.
- [ x ] I have checked the [CHANGELOG](https://github.com/allenai/allennlp/blob/main/CHANGELOG.md) and the [commit log](https://github.com/allenai/allennlp/commits/main) to find out if the bug was already fixed in the main branch.
- [ ] I have included in the "Description" section below a traceback from any exceptions related to this bug.
- [ x ] I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
- [ x ] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
- [ x ] I have included in the "Environment" section below the output of `pip freeze`.
- [ x ] I have included in the "Steps to reproduce" section below a minimally reproducible example.


## Description



I've started using AllenNLP since 2018, and I have already run **thousands** of NER benchmarks with it...since ELMo, and following with transformers, it's [CrfTagger](https://github.com/allenai/allennlp-models/blob/main/allennlp_models/tagging/models/crf_tagger.py) model has always yielded superior results in every possible benchmark for this task. However, since my research group trained different RoBERTa models for Portuguese, we have been conducting benchmarks comparing them with an existing [BERT model](https://huggingface.co/neuralmind/bert-base-portuguese-cased), but we have been getting inconsistent results compared to other frameworks, such as huggingface's transformers. 

Sorted results for AllenNLP grid search on CoNLL2003 using optuna (**all berts' results are better than all the robertas'**):
![image](https://user-images.githubusercontent.com/12713359/190008221-9644d488-f0e2-401d-8bf9-7bd9af7d4482.png)
Sorted results for huggingface's transformers grid search on CoNLL2003 (**all robertas' results are better than all the berts'**):
![image](https://user-images.githubusercontent.com/12713359/190010316-4136e6d8-ac18-445d-9c59-709c498b97f0.png)

I originally opened this as a [question on stackoverflow](https://stackoverflow.com/questions/73310991/is-allennlp-biased-towards-bert), as suggested in the issues guidelines (additional details already provided there), but I have failed to discover the problem by myself. I have run several unit tests from AllenNLP, concerning the tokenizers and embedders, and couldn't notice anything wrong, but I'm betting something is definetely wrong in the training process, since the results are so inferior for non-BERT models. 

Although I'm reporting details with the current release version, I'd like to point out that I had already run this CoNLL 2003 benchmark with RoBERTa/AllenNLP a long time ago too, so it's not something new. At the time the results for RoBERTa were quite below bert-base, but at the time I just thought RoBERTa wasn't competitive for NER (which is not true at all). 

It is expected that the results using AllenNLP are at least as good as the ones obtained using huggingface's framework.


## Related issues or possible duplicates

- [Opened as a question myself](https://github.com/allenai/allennlp/issues/5703)


## Environment


OS: Linux


Python version: 3.8.13

<details>
<summary>Output of <code>pip freeze</code>:</summary>



```
aiohttp==3.8.1
aiosignal==1.2.0
alembic==1.8.1
allennlp==2.10.0
allennlp-models==2.10.0
allennlp-optuna==0.1.7
asttokens==2.0.8
async-timeout==4.0.2
attrs==21.2.0
autopage==0.5.1
backcall==0.2.0
base58==2.1.1
blis==0.7.8
bokeh==2.4.3
boto3==1.24.67
botocore==1.27.67
cached-path==1.1.5
cachetools==5.2.0
catalogue==2.0.8
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
charset-normalizer==2.1.1
click==8.1.3
cliff==4.0.0
cloudpickle==2.2.0
cmaes==0.8.2
cmd2==2.4.2
colorama==0.4.5
colorlog==6.7.0
commonmark==0.9.1
conllu==4.4.2
converters-datalawyer==0.1.10
cvxopt==1.2.7
cvxpy==1.2.1
cycler==0.11.0
cymem==2.0.6
Cython==0.29.32
datasets==2.4.0
debugpy==1.6.3
decorator==5.1.1
deprecation==2.1.0
dill==0.3.5.1
dkpro-cassis==0.7.2
docker-pycreds==0.4.0
ecos==2.0.10
elasticsearch==7.13.0
emoji==2.0.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl
entrypoints==0.4
executing==1.0.0
fairscale==0.4.6
filelock==3.7.1
fire==0.4.0
fonttools==4.37.1
frozenlist==1.3.1
fsspec==2022.8.2
ftfy==6.1.1
future==0.18.2
gensim==4.2.0
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.8.2
google-auth==2.11.0
google-cloud-core==2.3.2
google-cloud-storage==2.5.0
google-crc32c==1.5.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.4
greenlet==1.1.3
h5py==3.7.0
hdbscan==0.8.28
huggingface-hub==0.8.1
hyperopt==0.2.7
idna==3.3
importlib-metadata==4.12.0
importlib-resources==5.4.0
inceptalytics==0.1.0
iniconfig==1.1.1
ipykernel==6.15.2
ipython==8.5.0
jedi==0.18.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.1.0
jsonnet==0.18.0
jupyter-core==4.11.1
jupyter_client==7.3.5
kiwisolver==1.4.4
krippendorff==0.5.1
langcodes==3.3.0
llvmlite==0.39.1
lmdb==1.3.0
lxml==4.9.1
Mako==1.2.2
MarkupSafe==2.1.1
matplotlib==3.5.3
matplotlib-inline==0.1.6
more-itertools==8.12.0
multidict==6.0.2
multiprocess==0.70.13
murmurhash==1.0.8
nest-asyncio==1.5.5
networkx==2.8.6
nltk==3.7
numba==0.56.2
numpy==1.23.3
optuna==2.10.1
osqp==0.6.2.post5
overrides==6.2.0
packaging==21.3
pandas==1.4.4
parso==0.8.3
pathtools==0.1.2
pathy==0.6.2
pbr==5.10.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
pluggy==1.0.0
preshed==3.0.7
prettytable==3.4.1
promise==2.3
prompt-toolkit==3.0.31
protobuf==3.20.0
psutil==5.9.2
pt-core-news-sm @ https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.3.0/pt_core_news_sm-3.3.0-py3-none-any.whl
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
py-rouge==1.1
py4j==0.10.9.7
pyannote.core==4.5
pyannote.database==4.1.3
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycaprio==0.2.1
pydantic==1.8.2
pygamma-agreement==0.5.6
Pygments==2.13.0
pympi-ling==1.70.2
pyparsing==3.0.9
pyperclip==1.8.2
pytest==7.1.3
python-dateutil==2.8.2
pytz==2022.2.1
PyYAML==6.0
pyzmq==23.2.1
qdldl==0.1.5.post2
regex==2022.8.17
requests==2.28.1
requests-toolbelt==0.9.1
responses==0.18.0
rich==12.1.0
rsa==4.9
s3transfer==0.6.0
sacremoses==0.0.53
scikit-learn==1.1.2
scipy==1.9.1
scs==3.2.0
seaborn==0.12.0
sentence-transformers==2.2.2
sentencepiece==0.1.97
sentry-sdk==1.9.8
seqeval==1.2.2
setproctitle==1.3.2
shellingham==1.5.0
shortuuid==1.0.9
simplejson==3.17.6
six==1.16.0
sklearn==0.0
smart-open==5.2.1
smmap==5.0.0
sortedcontainers==2.4.0
spacy==3.3.1
spacy-legacy==3.0.10
spacy-loggers==1.0.3
split-datalawyer==0.1.80
SQLAlchemy==1.4.41
srsly==2.4.4
stack-data==0.5.0
stanza==1.4.0
stevedore==4.0.0
tensorboardX==2.5.1
termcolor==1.1.0
TextGrid==1.5
thinc==8.0.17
threadpoolctl==3.1.0
tokenizers==0.12.1
tomli==2.0.1
toposort==1.7
torch==1.13.0.dev20220911+cu117
torchvision==0.14.0.dev20220911+cu117
tornado==6.2
tqdm==4.64.1
traitlets==5.3.0
transformers==4.21.3
typer==0.4.2
typing_extensions==4.3.0
umap==0.1.1
Unidecode==1.3.4
urllib3==1.26.12
wandb==0.12.21
wasabi==0.10.1
wcwidth==0.2.5
word2number==1.1
xxhash==3.0.0
yarl==1.8.1
zipp==3.8.1
```


</details>


## Steps to reproduce

I'm attaching some parameters I used for running the CoNLL 2003 grid search.

<details>
<summary>Example source:</summary>



```
export BATCH_SIZE=8
export EPOCHS=10
export gradient_accumulation_steps=4
export dropout=0.2
export weight_decay=0
export seed=42

allennlp tune \
 optuna_conll2003.jsonnet \
 optuna-grid-search-conll2003-hparams.json \
 --optuna-param-path optuna-grid-search-conll2003.json \
 --serialization-dir /models/conll2003/benchmark_allennlp \
 --study-name benchmark-allennlp-models-conll2003 \
 --metrics test_f1-measure-overall \
 --direction maximize \
 --skip-if-exists \
 --n-trials $1
```



[optuna_conll2003.jsonnet](https://github.com/allenai/allennlp/files/9560749/optuna_conll2003.jsonnet.txt)
[optuna-grid-search-conll2003.json](https://github.com/allenai/allennlp/files/9560750/optuna-grid-search-conll2003.json.txt)
[optuna-grid-search-conll2003-hparams.json](https://github.com/allenai/allennlp/files/9560751/optuna-grid-search-conll2003-hparams.json.txt)

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AllenNLP biased towards BERT #5711

Checklist

Description

Related issues or possible duplicates

Environment

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AllenNLP biased towards BERT #5711

Description

Checklist

Description

Related issues or possible duplicates

Environment

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions