fedora-l10n / fedora-localization-statistics

Files

Commit: 9b92ae9caeeb181b84818ac8847d2873330100b1

	CLDR-raw
	debug
	docker
	geo_data
	templates
	website
	.gitignore
	README.md
	nb_lang.png
	requirements.txt
	todo.md
	words_coverage.png
	build.py
	build_language_list.py
	build_map.py
	build_stats.py
	build_tm.py
	build_website.py
	convertCSV.py
	extract_srpm.sh
	runall.sh

README.md

fedora-localization-statistics

Global statistics on translation levels of fedora products

Requirements

dnf install podman

Create needed container images

Each release need is own image.

podman build . -f docker/Dockerfile.$release -t fedlocstats:$release

podman build . -f docker/Dockerfile.31 -t fedlocstats:31
podman build . -f docker/Dockerfile.32 -t fedlocstats:32
podman build . -f docker/Dockerfile.33 -t fedlocstats:33

Run the scripts

podman run -it --rm -v ./:/src:z -v ./srpms:/srpms:z --tmpfs /tmp:size=4G fedlocstats:$release $script

with $script, one of the following:

Get the source packages

./build.py get srpm lists, apply discover and compute progression stats

Detect languages

./build_language_list.py

For each package, produce progression stats.

Produce per package stats

./build_packages_stats.py

For each package, produce progression stats.

Produce global stats

./build_global_stats.py

Applies data cleanups and enhancements (cldr name).

Produce map

./build_map.py

Agregate the data per language, then apply it on territories (it uses stats from CLDR with language per territory).

Produce translation memories

./build_tm.py

Detect the list of languages Aggregate all files for a language and produce a compendium, a terminology and a translation memory.

Output files

0.error.language not in cldr.csv contains unknown languages (lines are removed)
0.error.languages is numeric.csv contains numeric languages (lines are removed)
0.error.lang with point.csv contains languages such as ".cp936" ".big5" (lines are removed)
0.error.len(language).csv contains languages with more than three caracters (lines are removed)
0.error.len(territory).csv contains territory with more than two caracters (lines are removed)
0.error.no population for this language-territory couple.csv contains the list of language-territory couple where no language statistics exists (no impact on results)
1.debug.lang.csv all lang (language + script + territory) values for debug (no impact on results)
1.debug.language.csv all lang values for debug (no impact on results)
1.debug.script.csv all script values for debug (no impact on results)
1.debug.territory.csv all territory values for debug (no impact on results)
1.debug.total message = 0.csv all lang values for debug (lines are removed)
3.result.csv full results per package with source filename and standardized language code, script code and territory code
4.0.cldr.csv language per territory as provided by CLDR
4.1.results_per_language.csv message and words progress percentages per language
4.1.results_per_language_ISO3.csv message and words progress percentages per language merged with "country code" database using ISO3166-1-Alpha-2 code
4.2.cldr_and_results_full.csv language per territory as provided by CLDR merged with message and words progress percentages per language
4.3.cldr_and_results_grouped.csv aggregation per territory of 4.2.cldr_and_results_full.csv, provides the territory, the number of languages, the population, the messages and words coverage.
4.4.world_stats.csv merge results of 4.3.cldr_and_results_grouped.csv with country database and geojson data.

Informations

Data in CLDR-raw folder comes from https://github.com/unicode-org/cldr/blob/master/common/main/en.xml

Ideas

CLDR supplementalData.xml: https://github.com/unicode-org/cldr/blob/master/common/supplemental/supplementalData.xml
use territoryContainment to build geographic groups
use languageData to detect default script
use languageData to have basic stats about territories
use territoryInfo to have advanced stats about territories
CLDR supplementalMetadata.xml: https://github.com/unicode-org/cldr/blob/master/common/supplemental/supplementalMetadata.xml
use the replacement values harmonize content
CLDR likelySubtags.xml: https://github.com/unicode-org/cldr/blob/master/common/supplemental/likelySubtags.xml
use the replacement advanced harmonization?
CLDR languageInfo.xml: https://github.com/unicode-org/cldr/blob/master/common/supplemental/languageInfo.xml
can we say if language is >= 90% close to another one, we can consider we propagate translation statistics?
CLDR languageGroup.xml: https://github.com/unicode-org/cldr/blob/master/common/supplemental/languageGroup.xml
what is it?

automatic calculation (group by territory + spoken percentage * spoken )

create stats: number of countries with official language > 50% and related population

create stats: number of languages impacting more than one official language