>>> import math
>>>
>>> def collision_probability(k, bits):
... n = 2 ** bits # Total unique hash values based on the number of bits
... probability = 1 - math.exp(- (k ** 2) / (2 * n))
... return probability * 100 # Return as percentage
...
>>> # Example usage:
>>> k_values = [100000, 1000000, 10000000]
>>> bits = 44 # Number of bits for the hash
>>>
>>> for k in k_values:
... print(f"Probability of collision for {k} hashes with {bits} bits: {collision_probability(k, bits):.4f}%")
...
Probability of collision for 100000 hashes with 44 bits: 0.0284%
Probability of collision for 1000000 hashes with 44 bits: 2.8022%
Probability of collision for 10000000 hashes with 44 bits: 94.1701%
>>> bits = 48
>>> for k in k_values:
... print(f"Probability of collision for {k} hashes with {bits} bits: {collision_probability(k, bits):.4f}%")
...
Probability of collision for 100000 hashes with 48 bits: 0.0018%
Probability of collision for 1000000 hashes with 48 bits: 0.1775%
Probability of collision for 10000000 hashes with 48 bits: 16.2753%
>>> bits = 52
>>> for k in k_values:
... print(f"Probability of collision for {k} hashes with {bits} bits: {collision_probability(k, bits):.4f}%")
...
Probability of collision for 100000 hashes with 52 bits: 0.0001%
Probability of collision for 1000000 hashes with 52 bits: 0.0111%
Probability of collision for 10000000 hashes with 52 bits: 1.1041%
>>>
If we adopted this scheme, we could have to increase the no. of characters (_first N_) from
11
to 12
and finally 13
as we approach globally larger enough Twts across the space. I _think_ at least full crawl/scrape it was around ~500k (_maybe_)? https://search.twtxt.net/ says only ~99k