☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Overview

Accuracy of BBC Weather forecasts for Honolulu

This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, there's a GitHub Action that runs at each 30 minute mark and saves the latest forecasts. The data is stored in a separate branch called data. Therefore, the data is versioned. This allows going back into the past to see the forecasts that were made for any given hour in the (relative) future.

I made this after watching Git scraping, the five minute lightning talk by Simon Willison. It blew my mind! I agree with Simon that collecting and versioning API data via git is a powerful pattern. You could use this pattern to keep a ledger of any dynamic forecasting system, such as the predicted outcomes of football games. In these dynamical systems, the forecasts are updated when new information becomes available. Therefore, the forecasted values depend on the point in time when they were made. I think that it's super interesting to analyse how these forecasts evolve through time.

The build_database.py script iterates through all the commits in the data branch and consolidates the data into an SQLite database. You can run the script yourself by simply cloning this repository. Then, go into a terminal, navigate to the cloned repository, and install the necessary Python dependencies:

python -m venv .env
source .env/bin/activate
pip install -r requirements.txt

Then, run the consolidation script:

python build_database.py

This will create a bbc_weather.sqlite file. You can load the latter into your preferred database access tool — I have a personal preference for DataGrip — to analyse the data. At present, the database contains two tables:

forecasts

These are the predicted weather values made at one point in time for a future point in time.

issued_at at celsius feels_like_celsius wind_speed_kph
2021-03-10 09:00:00 2021-03-10 11:00:00 24 30 16
2021-03-10 09:00:00 2021-03-10 12:00:00 25 31 17
2021-03-10 09:00:00 2021-03-10 13:00:00 26 32 17
2021-03-10 09:00:00 2021-03-10 14:00:00 27 33 17
2021-03-10 09:00:00 2021-03-10 15:00:00 26 33 17

observations

These are the weather values that actually occurred — as opposed to those that were forecasted.

at celsius wind_speed_kph
2021-03-09 19:00:00 23 0
2021-03-09 20:00:00 22 8
2021-03-09 21:00:00 22 0
2021-03-09 22:00:00 21 9
2021-03-09 23:00:00 21 0

Check out measure_accuracy.sql for an example of how to evaluate the correctness of the forecasts.

Owner
Max Halford
Data wizard @alan-eu. PhD in machine learning applied to query optimization. Kaggle competitions Master when it was cool. Online machine learning nut.
Max Halford
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022
An automated program that helps customers of Pizza Palour place their pizza orders

PIzza_Order_Assistant Introduction An automated program that helps customers of Pizza Palour place their pizza orders. The program uses voice commands

Tindi Sommers 1 Dec 26, 2021
Codes for processing meeting summarization datasets AMI and ICSI.

Meeting Summarization Dataset Meeting plays an essential part in our daily life, which allows us to share information and collaborate with others. Wit

xcfeng 39 Dec 14, 2022
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

DANeS - Open-source E-newspaper dataset Source: Technology vector created by macrovector - www.freepik.com. DANeS is an open-source E-newspaper datase

DATASET .JSC 64 Aug 17, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
Milaan Parmar / Милан пармар / _米兰 帕尔马 170 Dec 13, 2022
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 07, 2022
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 156 Dec 21, 2022
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 03, 2023
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

David Drysdale 3.1k Dec 29, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 09, 2021
Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

ThreatFox2Misp Creating a Feed of MISP Events from ThreatFox (by abuse.ch) What will it do? This will fetch IOCs from ThreatFox by Abuse.ch, convert t

17 Nov 22, 2022