Solving the Greatest Common Divisor Problem in Parallelx

advertisement
Solving the Greatest Common
Divisor Problem in Parallel
Derrick Coetzee
University of California, Berkeley
CS 273, Fall 2010, Prof. Satish Rao
The problem and serial solutions
• Given two positive integers (represented in binary),
find the large positive integer that divides both
• Let n will be the total number of bits in the inputs
• Solvable in O(n) divisions and O(n2) bit operations with
Euclidean algorithm:
– while b ≠ 0: (a, b) ← (b, a mod b)
• Matrix formulation:
• Schönhage uses this to solve in O(n log2n log log n)
Adding parallelism: Algorithm PM
• Brent & Kung (1982): reduce time of each step to
O(1) (O(n) bit complexity)
– α ← n; β ← n (where a,b ≤ 2n and a odd)
while b ≠ 0:
while b ≡ 0 (mod 2) { b ← b/2; β ← β − 1 }
if α ≥ β { swap(a,b); swap(α, β) }
if a+b ≡ 0 (mod 4): b ← ⌊(a+b)/2⌋
else: b ← ⌊(a−b)/2⌋
– Test “a+b ≡ 0 (mod 4)” uses only 2 lowest bits of sum,
enables effective pipelining of the steps
– Number of steps is still linear, requires n processors
Algorithm PM: Example
•
•
•
•
•
•
GCD(105, 30) (a = 11010012, b = 111102)
α ← 7; β ← 7
b ← 11112; β ← 6
swap (a ← 11112, b ← 11010012, α ← 6, β ← 7)
a + b ≡ 112 + 012 ≡ 002 (mod 4)
Compute trailing zeroes of new b and 2 more
bits:
– b ← ⌊(a+b)/2⌋ = (011112 + …010012)/2 = …1100
• b ← …112; β ← 5
• swap (a ← …112, b ← 11112, α ← 5, β ← 6)
Algorithm PM: Example
• a = …112, b = 11112, α = 5, β = 6
• a + b ≡ 112 + 112 ≡ 102 (mod 4)
• Compute LSB of new b while previous
iteration computes next bit of a in parallel:
– b' = ⌊(a−b)/2⌋ = …0
– a = …1112
• With another bit of a, can compute next bit of
b’
• Final result: a= 11112 , b’ = 0
Sublinear time GCD
• Kannan et al (1987): Break the problem into n/log
n steps, each of which takes log log n time and
eliminates log n bits of each input.
– CRCW, O(n log log n/log n) time, O(n2 log2n)
processors
– Idea: instead of a mod b, compute pa mod b for all
0≤p≤n
• If gcd(a,b) = d and 0 ≤ p ≤ n, then gcd(pa,b)/d ≤ n
• Pigeonhole principle gives first log n bits same for at least
two p
– gcd(a,b) = gcd(b,(p1a mod b)-(p2a mod b)) (ignoring factors ≤ n)
– Only need to know first log n bits to identify p1, p2 – can be
computed in log log n time
Sublinear time GCD: Example
• GCD(143, 221), n = 8
–
–
–
–
–
–
–
–
–
(0×143) mod 221 = 000…2
(1×143) mod 221 = 100…2
(2×143) mod 221 = 010…2
(3×143) mod 221 = 110…2
(4×143) mod 221 = 100…2
(5×143) mod 221 = 001…2
(6×143) mod 221 = 110…2
(7×143) mod 221 = 011…2
(8×143) mod 221 = 001…2
• (4×143) mod 221 − (1×143) mod 221 = 13
Sublinear time GCD: Using matrix form
• Problem: (p1a mod b) − (p2a mod b) takes log n
time
– Carry lookahead adder needs log n time to add n-bit
numbers
– Idea: Delay updates to (a, b) for log n steps, collecting
updates in a 2 × 2 matrix T, then compute T(a,b) in log
n time
– Results in n/log2n phases each taking log n time
– Each step performs constant number of log2n-bit
operations
• Ensures that each step still takes log log n time
– Time dominated by cost of the O(n/log n) steps
Getting rid of the log log n factor
• Chor and Goldreich (1990): Parallelize Brent & Kung’s
PM algorithm instead of Euclidean algorithm
– α ← n; β ← n
while b ≠ 0:
while b = 0 (mod 2) { b ← b/2; β ← β − 1 }
if α ≥ β { swap(a,b); swap(α, β) }
if a+b ≡ 0 (mod 4): b ← ⌊(a+b)/2⌋ else: b ← ⌊(a−b)/2⌋
– Idea: pack log n iterations of PM algorithm into each phase
• Each phase can be done in constant time
– Requires O(n/log n) time on O(n1+ε) processors
– Best-known algorithm
log n iterations in constant time
– α ← n; β ← n δ ← 0
while b ≠ 0:
while b = 0 (mod 2) { b ← b/2; β ← β − 1 δ ← δ + 1}
if α ≥ β δ ≥ 0 { swap(a,b); swap(α, β) δ ← −δ }
if a+b ≡ 0 (mod 4): b ← ⌊(a+b)/2⌋ else: b ← ⌊(a−b)/2⌋
– δ together with the last log n+1 bits of a and b determine
the transformation matrix of the next log n iterations
– Precompute a table – lookup, do the matrix multiplication
• Technicality: Need to treat |δ| ≤ log n and |δ| > log n differently
Multiplication in constant time
• How to multiply an n-bit number by a log n bit number
in constant time?
– Represent both numbers in base 2log n
– Precompute a multiplication table for this base
– Separate the larger number into a sum of two numbers,
each having zero for every other digit, e.g. 1234 = 1030 +
204
– Multiply both by the smaller number; no carries are
required
• Example: 1234 × 7 = (1030+204)×7 = 7210+1428 = 8638
– Finally, Chandra et al (1983) shows we can add two n-bit
numbers in O(1) time with O(n2) processors
• But this requires concurrent writes
Complexity perspective: an analogy
• P = NP?
– Open problem, but most practical problems have
either been shown to be in P or else NP-hard.
– Exceptions: integer factorization, graph isomorphism
• If integer factorization is NP-complete, NP = coNP
• If graph isomorphism is NP-complete, PH collapses to 2nd
level
• NC = P?
– Open problem, but most practical problems have
either been shown to be in NC or else P-hard.
– Exceptions: GCD
• What would GCD being P-complete imply?
Questions?
Download